Uploaded by Acoslington

S08 - Principles of Deep Learning I

advertisement
Principles of Deep Learning I
Itthi Chatnuntawech
Nanoinformatics and Artificial Intelligence (NAI)
November 14, 2022
Outline
❖
Deep Learning Components
Fully connected layer
❖ Computation graph
❖ Activation functions
❖ Loss function
❖ Model optimization
❖ Gradient descent
❖ Backpropagation
❖ Stochastic gradient descent (SGD)
❖ Regularization
❖ ℓ and ℓ regularization
2
1
❖ Dropout
❖ Batch normalization
❖ Convolutional layer
❖ Pooling layer
❖
❖
Convolutional Neural Network (CNN)
2
From Last Time: Image Classification Using Deep Learning
❖
Training phase: Optimize the weights of a deep neural network
Prepared inputs
Training loss
Prepared
outputs
0.7
1
0.2
0
0.1
0
0.9
1
Validation loss
Iteration
❖
Estimated
outputs
Test phase: Perform prediction using the trained neural network
0.01
orange!
3
Fully connected layer
(Dense)
Convolutional layer
Optimizer
Conv1D, 2D, 3D, …
separable Conv
SGD
Adam
RMSprop
Evaluation metric
accuracy
F1-score
AUC
confusion matrix
Deep Learning
Activation function
Loss function
categorical crossentropy
binary crossentropy
mean squared error
mean absolute error
❖
Pooling layer
max-pooling
average-pooling
Regularization
Dropout
Data augmentation
l1, l2 regularizations
sigmoid
softmax
ESP (swish)
ReLU
Combine basic components to build a neural network
• More components → “More” representative power
4
Artificial Neuron
Artificial neuron
Single neuron
1
𝑥1
𝑤1
𝑤2
𝑦=a
…
…
𝑥2
𝑤0
𝑥𝑁
𝑤𝑁
N
xnwn + w0
(∑
)
n=1
non-linear function (activation)
If we set all the weights to 0, then y = 0.
Different weights give rise
If we set w1 = 1 and the rest to 0, then y = a(x1).
to different behavior
If we set w0 = 0 and the rest to 1, then y = a(x1 + x2 + . . . + xN ).
The Essense of Artificial Neural Networks by Ivan Liljeqvist (Medium)
5
keyword: Multi-Layer Perceptron (MLP)
Fully ConnectedWLayer (Dense)
Each neuron
a(w T x)
2
W1
W3
Input
A
1
x
Output
B
2
y
C
3
D
4
E
Hidden layer 1
(ignore the biases for simplicity)
outA
outB
outC = a1(W1x) =a1
outD
outE
wA,1
wB,1
wC,1
wD,1
wE,1
wA,2
wB,2
wC,2
wD,2
wE,2
wA,3
wB,3
wC,3
wD,3
wE,3
wA,4
wB,4
wC,4
wD,4
wE,4
x1
x2
x3
x4
y = a3(W3a2(W2a1(W1x)))
6
keyword: Multi-Layer Perceptron (MLP)
Representing Neural Network
y = a3(W3a2(W2a1(W1x)))
W1
W2
W3
Input
Output
x
y
Computation graph
x
W1
x
a1
Fully
connected 1
(5 nodes, a1)
W2
a2
Fully
connected 2
(7 nodes, a2)
W3
a3
y
Fully
connected 3
(3 nodes, a3)
y
7
Activation Functions
Introduction to Exponential Linear Unit
8
class 1
class 2
Two-class Classification
probability of
being class 1
1
activation functions
0
Fully
connected 1
(3 nodes,
ReLU)
Fully
connected 2
(2 nodes,
ReLU)
Fully
connected 3
(1 nodes,
sigmoid)
0.99
model.add(tf.keras.layers.Dense(3, activation=‘relu’))
model.add(tf.keras.layers.Dense(2, activation=‘relu’))
model.add(tf.keras.layers.Dense(1, activation=‘sigmoid’))
9
class 1
class 2
class 3
Three-class Classification
probability of
being class 1
probability of
being class 2
probability of
being class 3
1
activation functions
0
Fully
connected 1
(3 nodes,
ReLU)
Fully
connected 2
(2 nodes,
ReLU)
Fully
connected 3
(3 nodes,
sigmoid)
0.99
What if our model output something like 0.99
0.99
0.99
0.1
0.02
? Not good
10
class 1
class 2
class 3
Three-class Classification
probability of
being class 1
probability of
being class 2
probability of
being class 3
activation functions
Softmax function
z1 = 5
…
Softmax function
e zj
z2 = 2
σ(z)i =
z3 = 1
K = 3 in this case
K
∑j=1 e zj
e z1
e5
= 5
= 0.936
z
z
z
2
1
3
1
2
e +e +e
e +e +e
e2
e z2
= 5
= 0.047
e + e2 + e1
e z1 + e z2 + e z3
e1
e z3
= 5
= 0.017
z
2
1
z
z
e +e +e
e 1+e 2+e 3
0.936 + 0.047 + 0.017 = 1
11
keywords: multi-class classification ≠multi-label classification
Extend the model to K-class Classification
Fully
connected 2
(2 nodes,
ReLU)
Fully
connected 3
(K nodes,
softmax)
P(class 2)
…
Fully
connected 1
(3 nodes,
ReLU)
P(class 1)
P(class K)
model.add(tf.keras.layers.Dense(3, activation=‘relu’))
model.add(tf.keras.layers.Dense(2, activation=‘relu’))
model.add(tf.keras.layers.Dense(K, activation=‘softmax’))
12
Fully connected layer
(Dense)
Convolutional layer
Optimizer
Conv1D, 2D, 3D, …
separable Conv
SGD
Adam
RMSprop
Evaluation metric
accuracy
F1-score
AUC
confusion matrix
Deep Learning
Activation function
Loss function
categorical crossentropy
binary crossentropy
mean squared error
mean absolute error
❖
Pooling layer
max-pooling
average-pooling
Regularization
Dropout
Data augmentation
l1, l2 regularizations
sigmoid
softmax
ESP (swish)
ReLU
Combine basic components to build a neural network
• More components → “More” representative power
13
Image Classification Using Deep Learning
❖
Training phase: Optimize the weights of a deep neural network
Prepared inputs
Training loss
Prepared
outputs
0.7
1
0.2
0
0.1
0
0.9
1
Validation loss
Iteration
❖
Estimated
outputs
Test phase: Perform prediction using the trained neural network
0.01
orange!
14
Loss function
(cost function and error function)
❖
A function L that maps values of variable(s) onto a real number
For example, L(y, y)̂ gives us a number that tell us how “different” y and ŷ are
❖
We have used it to monitor how our model performs during
the training phase
Training loss
1 N
L(y, y)̂ =
(yi − yî )2
N∑
i=1
Validation loss
Iteration
1 N
| yi − yî |
L(y, y)̂ =
∑
N i=1
❖
Mean squared error (MSE) and mean absolute error (MAE) are two
popular choices for regression
❖
Cross-entropy is the most common choice for a classification task
15
Cross-entropy
❖
Well suited to comparing probability distributions
3-class classification
deep learning layers
L(y, y)̂ = −
K
∑
k=1
yklogyk̂
Fully
connected
(3 nodes,
softmax)
predicted
prob. dist.
true prob.
dist.
0.7
1
0.1
0
0.2
0
ŷ
by definition (K = 3)
y
One-hot
encoding
= − (1 × log(0.7) + 0 × log(0.1) + 0 × log(0.2)) = − 1 × log(0.7) = 0.15
Assume that we have updated the weights of our model and got ŷ = [0.9, 0.1, 0],
then L(y, y)̂ = − 1 × log(0.9) + 0 + 0 = 0.046
Lower loss. ŷ has become more similar to y than the previous case.
Our model performs better!
16
Optimizing Our Model
❖
❖
To train our model fW, we adjust its weights W in a way that decreases
your loss function
Particularly, we are minimizing our loss function with respect to W
prepared output/
labels/targets
prepared input
L(y, y)̂ = L(y, fW(x))
predicted
output
Weights
Calculus!
Compute the derivative of L(y, fW (x)) with respect to W and set it to 0.
17
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
w
Method 1: Compute the derivative and set it to 0
Gradient
dL(w) dw 2
=
= 2w
dw
dw
=0
w=0
In our case, we have L(y, fW (x)). Not simple to compute the gradient with respect to W.
Even when we have a way to compute the gradient and manage to set it to zero, we might
not be able to get a closed-form solution for it.
18
keyword: gradient descent
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
• Randomly start somewhere
• Gradually move to the
minimum of the function
w
Method 2: Iteratively solve for w
Gradient Descent for Logistic Regression Simplified – Step by Step Visual Guide
19
keyword: gradient descent
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
dL(w)
dw
positive value
(positive slope → move left to
decrease the function)
zero
negative value
w
Method 2: Iteratively solve for w
(negative slope → move right
to decrease the function)
w
dL(w)
1. Guess an initial solution w0 and pick α
The negative of the gradient −
tells us
dw
Iteration
k
For k = 0,1,2,...
the direction we should move to decrease the
function
dL(w)
|w=w = 2w |w=w = 2w(k)
2. Compute
(k)
(k)
dw
dL(w)
|w=w = w(k) − α * 2w(k)
3. Compute a new solution w(k+1) := w(k) − α
(k)
dw
4. Repeat step 2 and 3 until converge
step size/learning rate α indicates
how far we want to move
20
keyword: gradient descent
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
w(0) = − 2, α = 0.5
w(1) = w(0) − α * 2w(0) = − 2 − 0.5 * 2 * −2 = 0
w(2) = w(1) − α * 2w(1) = 0
w(3) = w(2) − α * 2w(2) = 0
w
Method 2: Iteratively solve for w
1. Guess an initial solution w0 and pick α
For k = 0,1,2,...
dL(w)
|w=w = 2w |w=w = 2w(k)
2. Compute
(k)
(k)
dw
dL(w)
|w=w = w(k) − α * 2w(k)
3. Compute a new solution w(k+1) := w(k) − α
(k)
dw
4. Repeat step 2 and 3 until converge
21
keyword: gradient descent
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
w(0) = 2, α = 0.5
w(1) = w(0) − α * 2w(0) = 2 − 0.5 * 2 * 2 = 0
w(2) = w(1) − α * 2w(1) = 0
w(3) = w(2) − α * 2w(2) = 0
w
Method 2: Iteratively solve for w
1. Guess an initial solution w0 and pick α
For k = 0,1,2,...
dL(w)
|w=w = 2w |w=w = 2w(k)
2. Compute
(k)
(k)
dw
dL(w)
|w=w = w(k) − α * 2w(k)
3. Compute a new solution w(k+1) := w(k) − α
(k)
dw
4. Repeat step 2 and 3 until converge
22
keyword: gradient descent
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
w(0) = 2, α = 0.25
w(1) = w(0) − α * 2w(0) = 2 − 0.25 * 2 * 2 = 1
w(2) = w(1) − α * 2w(1) = 1 − 0.25 * 2 * 1 = 0.5
w(3) = w(2) − α * 2w(2) = 0.5 − 0.25 * 2 * 0.5 = 0.25
w
Method 2: Iteratively solve for w
w(4) = w(3) − α * 2w(3) = 0.125
w(5) = w(4) − α * 2w(4) = 0.0625
1. Guess an initial solution w0 and pick α
For k = 0,1,2,...
dL(w)
|w=w = 2w |w=w = 2w(k)
2. Compute
(k)
(k)
dw
dL(w)
|w=w = w(k) − α * 2w(k)
3. Compute a new solution w(k+1) := w(k) − α
(k)
dw
4. Repeat step 2 and 3 until converge
23
keyword: gradient descent
Example: Optimization
❖
Goal: Find w that minimizes the following loss function
L(w) = w 2
w(0) = 2, α = 1
w(1) = w(0) − α * 2w(0) = 2 − 1 * 2 * 2 = − 2
w(2) = w(1) − α * 2w(1) = − 2 − 1 * 2 * −2 = 2
w(3) = w(2) − α * 2w(2) = − 2
w
Method 2: Iteratively solve for w
w(4) = w(3) − α * 2w(3) = 2
w(5) = w(4) − α * 2w(4) = − 2
1. Guess an initial solution w0 and pick α
For k = 0,1,2,...
α: learning rate
dL(w)
|w=w = 2w |w=w = 2w(k)
2. Compute
(k)
(k)
dw
dL(w)
|w=w = w(k) − α * 2w(k)
3. Compute a new solution w(k+1) := w(k) − α
(k)
dw
4. Repeat step 2 and 3 until converge
24
keywords: local minimum/maximum,
global minimum/maximum
Example: Gradient
Deep Learning, NeuroEvolution & Extreme Learning Machines
25
Optimization: Learning Rate
Fixed learning rate
Adaptive learning rate
Deep Learning, NeuroEvolution & Extreme Learning Machines
26
Gradient
1-dimensional case
2-dimensional case
L(w1) = w12
L(w1, w2) = w12 + w22
w2
w1
Derivative of L(w1)
Partial derivatives of L(w1, w2)
Weight update
Weight update
dL(w1) dw12
=
= 2w1
dw1
dw1
dL(w1)
|w =w
w1, (k+1) := w1, (k) − α
1
1, (k)
dw1
= w1, (k) − 2w1, (k)
w2
w1
∂L(w1, w2)
∂
=
(w12 + w22) = 2w1
∂w1
∂w1
∂L(w1, w2)
∂
=
(w12 + w22) = 2w2
∂w2
∂w2
Gradient ∇w L(w1, w2)
w1, (k+1)
w1, (k)
= w
−α
w
[ 2, (k+1)] [ 2, (k)]
∂L(w1, w2)
∂w1
∂L(w1, w2)
∂w2
|w =w
1
1, (k)
|w =w
2
2, (k)
w1, (k)
2w1, (k)
= w
−α
[ 2, (k)]
[2w2, (k)]
27
Optimization
Real Life
w2
w2
w1
w1
No local minima
Could get stuck in
problem
local minima
(global minimum
= local minimum)
Furthermore, how do we compute a complicated loss
function such as the one in our example L(y, fW (x))?
O'Reilly Media
w2
w1
@#Q#@%!!!!!!
Backpropagation!
Yoshua Bengio
28
Backpropagation
Computation graph
w1
+
z = w1 + w2
×
w2
w3
L
L(w1, w2, w3) = z × w3 = (w1 + w2) × w3
Exercise: Compute the gradient vector
∂L
∂L ∂z
∂w3z ∂(w1 + w2)
=
=
= w3(1 + 0) = w3
∂w1
∂z ∂w1
∂z
∂w1
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(0 + 1) = w3
∂w2
∂z ∂w2
∂z
∂w2
∂zw3
∂L
=
= z = w1 + w2
∂w3
∂w3
L = z × w3
z
Use chain rule
w3
w3
∇w L(w1, w2, w3) =
w1 + w2
Modified from CS231n Convolutional Neural Networks for Visual Recognition
29
Backpropagate
w1
w2
∂z
=1
∂w1
∂z
=1
∂w2
w3
+
∂L
= w3
∂z
z = w1 + w2
1
L = z × w3
z
×
L
∂L
= w1 + w2
∂w3
Exercise: Compute the gradient vector
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(1 + 0) = w3
∂w1
∂z ∂w1
∂z
∂w1
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(0 + 1) = w3
∂w2
∂z ∂w2
∂z
∂w2
∂zw3
∂L
=
= z = w1 + w2
∂w3
∂w3
w3
w3
∇w L(w1, w2, w3) =
w1 + w2
Modified from CS231n Convolutional Neural Networks for Visual Recognition
30
Backpropagate
w1
w2
∂z
=1
∂w1
∂z
=1
∂w2
w3
+
∂L
= w3
∂z
z = w1 + w2
1
L = z × w3
z
×
L
∂L
= w1 + w2
∂w3
Exercise: Compute the gradient vector
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(1 + 0) = w3
∂w1
∂z ∂w1
∂z
∂w1
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(0 + 1) = w3
∂w2
∂z ∂w2
∂z
∂w2
∂zw3
∂L
=
= z = w1 + w2
∂w3
∂w3
w3
w3
∇w L(w1, w2, w3) =
w1 + w2
Modified from CS231n Convolutional Neural Networks for Visual Recognition
31
Backpropagate
w1
w2
∂z
=1
∂w1
∂z
=1
∂w2
w3
+
∂L
= w3
∂z
z = w1 + w2
1
L = z × w3
z
×
L
∂L
= w1 + w2
∂w3
Exercise: Compute the gradient vector
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(1 + 0) = w3
∂w1
∂z ∂w1
∂z
∂w1
∂w3z ∂(w1 + w2)
∂L ∂z
∂L
=
=
= w3(0 + 1) = w3
∂w2
∂z ∂w2
∂z
∂w2
∂zw3
∂L
=
= z = w1 + w2
∂w3
∂w3
w3
w3
∇w L(w1, w2, w3) =
w1 + w2
Modified from CS231n Convolutional Neural Networks for Visual Recognition
32
With actual numbers
-2
w1
+
5
z = w1 + w2
3
L = z × w3
z
×
w2
L -12
-4
w3
Forward pass
z =−2+5=3
L = − 4 × 3 = − 12
We calculate the values of the variables in the computation graph
Modified from CS231n Convolutional Neural Networks for Visual Recognition
33
With actual numbers
∂z
=1
∂w1
-2
w1
-4
5
w2
∂z
=1
∂w2
-4
-4
w3
3
Forward pass
+
∂L
= w3 = − 4
∂z
z = w1 + w2
1
z
L = z × w3
3
L -12
×
∂L
= w1 + w2 = 3
∂w3
z =−2+5=3
L = − 4 + 3 = − 12
We calculate the values of the variables in the computation graph
Backward pass
We calculate the gradients and evaluate them using the values from the forward pass
∂L ∂z
∂L
=
=−4×1=−4
∂w1
∂z ∂w1
∂L ∂z
∂L
=
=−4×1=−4
∂w2
∂z ∂w2
Modified from CS231n Convolutional Neural Networks for Visual Recognition
∂L
=z=3
∂w3
−4
∇w L = −4
[3]
34
Update the weights
-2
w1
-4
+
5
z = w1 + w2
L = z × w3
z
w2
-4
-4
×
L
w3
3
Given
−2
−4
at w(0) and α = 1
w(0) = 5
, ∇w L = −4
[−4]
[3]
L(w(0)) = − 12
Weight update
−2
2
−4
w(1) = w(0) − α ∇w L = 5 − 1 −4 = 9
[−4]
[ 3 ] [−7]
L(w(1)) = − 77
w(2) = w(1) − α ∇w L = ?
L(w(2)) = ?
Modified from CS231n Convolutional Neural Networks for Visual Recognition
35
Backpropagation
f(w, x) =
1
1 + e −(w0 x0+w1x1+w2)
Taken from CS231n Convolutional Neural Networks for Visual Recognition
36
Exploding/Vanishing Gradients
0.055 = 3.125 × 10−7
155 = 759375
Very small
Very large
Taken from Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients
37
prepared
outputs
loss
values
y1̂
y1
L(y1, y1̂ )
y2̂
y2
L(y2, y2̂ )
…
…
prepared inputs
predicted
outputs
…
Training Our Model
yN̂
yN
L(yN , yN̂ )
x1
fW
x2
…
deep learning model
xN
Total loss L(y, y)̂ =
N
∑
i=1
N
L(yi, yî ) = ∑ L(yi, fW (xi))
i=1
Update the weights
Compute the gradient of the total loss w.r.t. to W using backpropagation
∇W L =
Assign the new weights
N
∑
i=1
∇W L(yi, fW (xi))
W ← W − α ∇W L
38
Stochastic Gradient Descent
❖
Every time we want to update W, we need to compute the loss functions and
their gradients of all samples
∇W L =
❖
❖
N
∑
i=1
∇W L(yi, fW (xi))
For very large N, we might not be able to do it due to both the time
and memory constraints 14,000,000 images?
For stochastic gradient descent (SGD), we update W based on subsets of
samples called a mini-batch
Training Data
batch 1
batch 2
batch 3
…
batch B
1 epoch = the model has seen the whole training data once (i.e., B batches in this case)
In this example, the weight matrix W is updated B times per epoch.
39
Optimizers
❖
Popular choices are
Stochastic gradient descent (SGD)
❖ SGD with momentum
❖ RMSprop
❖
Adam
❖ ADADELTA
❖ AdaGrad
❖
Loss landscape (bright = low, dark = high)
w2
w2
w1
w1
Interactive Visualization of Optimization Algorithms in Deep Learning
40
Loss landscape (bright = low, dark = high)
w2
w1
w2
w1
Interactive Visualization of Optimization Algorithms in Deep Learning
41
Fully connected layer
(Dense)
Convolutional layer
Optimizer
Conv1D, 2D, 3D, …
separable Conv
SGD
Adam
RMSprop
Evaluation metric
accuracy
F1-score
AUC
confusion matrix
Deep Learning
Activation function
Loss function
categorical crossentropy
binary crossentropy
mean squared error
mean absolute error
❖
Pooling layer
max-pooling
average-pooling
Regularization
Dropout
Data augmentation
l1, l2 regularizations
sigmoid
softmax
ESP (swish)
ReLU
Combine basic components to build a neural network
• More components → “More” representative power
42
y
probability of
being class 1
x
probability of
being class 2
probability of
being class 3
activation functions
ReLU
Softmax
ReLU
# Import necessary modules
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create the model
model = keras.Sequential()
model.add(layers.Dense(3, activation="relu"))
model.add(layers.Dense(2, activation="relu"))
model.add(layers.Dense(3, activation=“softmax”))
# Compile the model
model.compile(
optimizer='adam',
loss=tf.keras.losses.CategoricalCrossentropy(),
)
Training loss
Validation loss
Iteration
Model
Loss function
and optimizer
# Train the model for 100 epochs with a batch size of 32
model.fit(x_train, y_train, batch_size=32, epochs=100, validation_data=(x_val,y_val))
43
Regularization
overfitting
Validation loss
Training loss
Training loss
Iteration
❖
Validation loss
Iteration
Regularization is frequently used to mitigate overfitting
❖
Add a regularization term to the loss function
❖ ℓ regularization
2
min
N
∑
i=1
❖
→
min
N
∑
i=1
L(yi, fW (x))
→
Use dropout - much more popular
∑
i=1
ℓ1 regularization
min
❖
L(yi, fW (x))
N
min
L(yi, fW (x)) + λ∥w∥2
N
∑
i=1
L(yi, fW (x)) + λ∥w∥1
44
Dropout
❖
keyword: dropblock
for CNN
Dropout can be used to reduce overfitting
❖
Randomly omit some neurons/units
Dropout for Deep Learning Regularization, explained with Examples
45
Dropout
test loss
training
accuracy
training accuracy
with dropout
test accuracy
with dropout
test loss with
dropout
test accuracy
training loss
with dropout
training loss
Dropout for Deep Learning Regularization, explained with Examples
46
Batch Normalization
Normalize every batch
❖ reduce internal covariate shift problems*
❖ Can be thought of as a form of implicit regularization
❖ can help reduce overfitting
❖
Counts
Values
Values
Batch Normalization — Speed up Neural Network Training
*This has been challenged by more recent work
47
Batch Normalization
Normalize every batch
❖ reduce internal covariate shift problems*
❖ Can be thought of as a form of implicit regularization
❖ can help reduce overfitting
❖
Batch Normalization — Speed up Neural Network Training
*This has been challenged by more recent work
48
keywords: layer normalization,
instance normalization, group
normalization, standardization
Batch Normalization
❖
Reduce internal covariate shift problems*
batch size
❖
# of features
More robust to bad initialization
Lecture 7 (Stanford, CS231n, 2018)
*This has been challenged by more recent work
49
Data Augmentation
Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search
50
Fully connected layer
(Dense)
Convolutional layer
Optimizer
Conv1D, 2D, 3D, …
separable Conv
SGD
Adam
RMSprop
Evaluation metric
accuracy
F1-score
AUC
confusion matrix
Deep Learning
Activation function
Loss function
categorical crossentropy
binary crossentropy
mean squared error
mean absolute error
❖
Pooling layer
max-pooling
average-pooling
Regularization
Dropout
Data augmentation
l1, l2 regularizations
sigmoid
softmax
ESP (swish)
ReLU
Combine basic components to build a neural network
• More components → “More” representative power
51
W1
Input
Output
A
1
x
B
2
y
C
3
D
4
E
Hidden layer 1
(ignore the biases for simplicity)
outA
outB
outC = a1(W1x) = a1
outD
outE
wA,1
wB,1
wC,1
wD,1
wE,1
wA,2
wB,2
wC,2
wD,2
wE,2
wA,3
wB,3
wC,3
wD,3
wE,3
wA,4
wB,4
wC,4
wD,4
wE,4
x1
x2
x3
x4
# of weights for hidden layer 1
= 4 x 5 = 20
# of weights per layer = # of incoming nodes x # of hidden nodes
What if our input dimension is 256x256=65,536 and we use 1024 hidden nodes?
67 M trainable parameters for just one layer!
52
Do we need that many parameters?
Would it be the case that
❖
Only neighboring neurons talk to
each other?
❖
Furthermore, they talk to their
neighbors in the same way?
wg,1
w1
wg,2
w2
1M nodes in
between
…
w3
…
1M nodes in
between
…
wg,3
…
❖
wo,1
w1
wo,2
w2
wo,3
Much fewer parameters!
w3
53
2D Convolution
❖
Similarly, for images, maybe only neighboring pixels talk to each
other?
2D convolution
2D pooling
Deep learning radiology: the secret of convolutional neural networks
54
2D Convolution
y[n1, n2] = x[n1, n2] * h[n1, n2] =
∞
∑
∞
∑
k1=−∞ k2=−∞
x[k1, k2]h[n1 − k1, n2 − k2]
2D Conv
0*0 + 2*1 + 1*0 + 0*1 + 1*1 + 1*1 + -3*0 + -1*1 + 1*0
0
2
1
1
0
0
1
0
0
1
1
1
0
1
1
1
-3
-1
1
0
1
0
1
0
0
6
0
0
1
0
4
1
7
0
x
Input
h
Filter/kernel
defines the relationship
between neighboring pixels
3
y
55
2D Convolution
y[n1, n2] = x[n1, n2] * h[n1, n2] =
∞
∑
∞
∑
k1=−∞ k2=−∞
x[k1, k2]h[n1 − k1, n2 − k2]
2D Conv
2*0 + 1*1 + 1*0 + 1*1 + 1*1 + 1*1 + -1*0 + 1*1 + 0*0
0
2
1
1
0
0
1
0
0
1
1
1
0
1
1
1
-3
-1
1
0
1
0
1
0
0
6
0
0
1
0
4
1
7
0
x
Input
h
Filter/kernel
defines the relationship
between neighboring pixels
3
5
y
56
2D Convolution
y[n1, n2] = x[n1, n2] * h[n1, n2] =
∞
∑
∞
∑
k1=−∞ k2=−∞
x[k1, k2]h[n1 − k1, n2 − k2]
2D Conv
1*0 + 1*1 + 0*0 + 1*1 + 1*1 + 0*1 + 1*0 + 0*1 + 1*0
0
2
1
1
0
0
1
0
0
1
1
1
0
1
1
1
-3
-1
1
0
1
0
1
0
0
6
0
0
1
0
4
1
7
0
x
Input
h
Filter/kernel
defines the relationship
between neighboring pixels
3
5
3
y
57
2D Convolution
y[n1, n2] = x[n1, n2] * h[n1, n2] =
∞
∞
∑
∑
k1=−∞ k2=−∞
x[k1, k2]h[n1 − k1, n2 − k2]
2D Conv
0
2
1
1
0
0
1
0
0
1
1
1
0
1
1
1
3
5
3
-3
-1
1
0
1
0
1
0
4
1
3
0
6
0
0
1
9
8
8
0
4
1
7
0
x
Input
h
Filter/kernel
defines the relationship
between neighboring pixels
y
smaller output
58
2D Convolution
Zero-padding
0
0
0
0
0
0
0
0
0
2
1
1
0
0
0
0
1
1
1
0
0
0
-3
-1
1
0
1
0
0
0
6
0
0
1
0
0
0
4
1
7
0
0
0
0
0
0
0
0
0
x
Input
0
1
0
1
1
1
3
5
3
0
1
0
4
1
3
9
8
8
h
Filter/kernel
defines the relationship
between neighboring pixels
y
smaller output
59
2D Convolution
Zero-padding
0
0
0
0
0
0
0
0
0
2
1
1
0
0
0
0
1
1
1
0
0
0
-3
-1
1
0
1
0
0
0
6
0
0
1
0
0
0
4
1
7
0
0
0
0
0
0
0
0
0
x
Input
0
1
0
2
4
5
3
1
1
1
1
-2
3
5
3
2
0
1
0
-4
4
1
3
2
3
9
8
8
2
4
11
12
8
8
h
Filter/kernel
defines the relationship
between neighboring pixels
y
output of the same size
60
2D Convolution
❖
Low-pass Filter
*
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
=
Filter/kernel
61
2D Convolution
❖
“Gradient image”
-1
1
-1
1
*
=
Filter/kernel
62
2D Convolution
❖
“Gradient image”
-1
-1
1
1
*
=
Filter/kernel
63
2D Convolution
❖
“Plus-sign Detector”
*
0
1
0
1
1
1
0
1
0
=
Filter/kernel
64
2D Convolution
❖
Different filters/kernels process images differently
❖
❖
❖
❖
❖
Low-pass filter (blurring/denoising)
High-pass filter (sharpening)
Shape detector
Refer to Image Kernels for a good demo
complementary
information
Instead of defining the filter ourselves, we let neural network
come up with good ones for us
w11
w12
w13
w21
w22
w23
w31
w32
w33
Filter/kernel
defines the relationship
between neighboring pixels
model.add(Conv2D(32, (3, 3), padding="same", activation="relu"))
# of filters
filter size
65
2D Convolution
❖
2D convolution with more than one channels
G
R
B
1 filter
G
R
B
128 filters
A Comprehensive Introduction to Different Types of Convolutions in Deep Learning
66
Pooling Layers
model.add(layers.MaxPooling2D((2,2)))
Mixed Pooling for Convolutional Neural Networks
67
keywords: feature maps, activation
maps, receptive field, backbone
Convolutional Neural Network (CNN)
feature maps
feature
maps
feature
maps
probability of
being a cat?
probability of
being a dog?
Same filter size (e.g., 3 x 3 filter), but different coverages for different “resolutions”
Disadvantages of CNN models
68
keywords: feature maps, activation
maps, receptive field, backbone
Convolutional Neural Network (CNN)
feature
maps
feature maps
feature
maps
probability of
being a cat?
probability of
being a dog?
Classification/
regression
feature 2
feature 2
Feature extraction
feature 1
Disadvantages of CNN models
feature 1
69
Fully connected layer
(Dense)
Convolutional layer
Optimizer
Conv1D, 2D, 3D, …
separable Conv
SGD
Adam
RMSprop
Evaluation metric
accuracy
F1-score
AUC
confusion matrix
Deep Learning
Activation function
Loss function
categorical crossentropy
binary crossentropy
mean squared error
mean absolute error
❖
Pooling layer
max-pooling
average-pooling
Regularization
Dropout
Data augmentation
l1, l2 regularizations
sigmoid
softmax
ESP (swish)
ReLU
Combine basic components to build a neural network
• More components → “More” representative power
70
ImageNet
More networks: Xception, ResNeXt, DenseNet,
MobileNet, NASNet, EfficientNet, SqueezeNet,
Feature Pyramid Network
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
The SVM era
(traditional ML)
top-5 classification error
The deep learning era
Residual connection
(ZFNet)
Stanford CS231n (Spring 2022):
Lecture 6.
- Conv
- Pooling
- Dense
- Dropout
- ReLU
- SGD with
momentum
Optimized
“AlexNet” in
terms of #
filter, kernel
sizes etc.
19 layers
with
filter size
= 3x3
- Introduce 1x1
Conv to reduce
the dimensions
- Multiple
losses to help
with gradients
- Inception
module
- Skip
connection to
help with the
vanishing
gradient
problem
- Batch
normalization
- Network
ensemble
- Squeeze-andExcitation (SE)
block
- Adaptive
feature map
reweighting
The last annual
ImageNet
competition
was held in
2017
Model complexity
Floating-point operations (FLOPs) required for a single forward pass
Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE access 6 (2018): 64270-64277.
Titan Xp
Jetson TX1
Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE access 6 (2018): 64270-64277.
Understanding CNN - Filter Visualization
Learned weights of the filters at the first layer
64 filters filter size = 11 x 11
with 3 channels
What these filters look for (through
template matching/inner product)
- oriented edges
- opposing colors
Similar to human visual system!
input image
AlexNet
conv1
Stanford CS231n (Spring 2022): Lecture 8.
74
Understanding CNN - Filter Visualization
conv
conv
conv
Stanford CS231n (Spring 2022): Lecture 8.
The weights of conv2 tell us what type of the activation patterns after
conv1 that would cause conv2 to maximally activate. These filters are not
connected directly to the input image, so it is not directly interpretable.
We need more complicated techniques to get a sense of what is going on.
75
input image
Understanding CNN - Last Layer
output of fc 7 = feature vector (last layer)
of size 4096
Stanford CS231n (Spring 2022): Lecture 8.
fc7
Test
image
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep
convolutional neural networks." Advances in neural information processing systems 25 (2012).
L2 Nearest neighbors in pixel space
Stanford CS231n (Spring 2022): Lecture 8.
76
input image
Understanding CNN - Last Layer
output of fc 7 = feature vector (last layer)
of size 4096
Stanford CS231n (Spring 2022): Lecture 8.
fc7
Perform a machine learning algorithm to the output of fc7
Barnes-Hut t-SNE
https://cs.stanford.edu/people/karpathy/cnnembed/
77
Understanding CNN - Other Techniques
Saliency Maps
gradient of class score w.r.t. image pixels
Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks:
Visualising image classification models and saliency maps." arXiv preprint arXiv:1312.6034 (2013).
Occlusion Experiments
Zeiler, Matthew D., and Rob Fergus. "Visualizing and
understanding convolutional networks." European conference
on computer vision. Springer, Cham, 2014.
…and many more such as
❖ Guided Backpropagation
❖ Grad-CAM, Score-CAM, HiResCAM
❖ Attention map visualization
Stanford CS231n (Spring 2022): Lecture 8.
78
Trained CNN as a Feature Extractor
What these filters look for (through template
matching/inner product)
- oriented edges
- opposing colors
These patterns are important for processing other types of images as
well, so we can reuse trained weights when we encounter a related
task or the same task but with a different dataset
Stanford CS231n (Spring 2022): Lecture 8.
79
Trained CNN as a Feature Extractor
Previous Task: Cat-vs-dog classification
probability of
being a cat?
probability of
being a dog?
Train all the weights
of this network
New Task: Watermelon-vs-orange classification
Pretrained feature extractor
To-betrained
classifier
Take parts of the trained network
and use them as a feature extractor
Only train the
weights here
watermelon
or orange?
80
Transfer Learning and Fine-Tuning
14M images
AI-vs-Artists (2 classes)
Assume small dataset
layer 1
layer 1
layer 2
layer 2
layer L-1
Less
“reusable”
layer L
…
…
More
“reusable”
ImageNet (1000 classes)
Clone these layers
along with their
weights and do not
modify the weights
FC-4096
FC-4096
FC-1000
layer L-1
layer L
FC-4096
Replace the
classification layer
and train it
FC-4096
FC-2
81
Transfer Learning and Fine-Tuning
layer 1
layer 1
layer 2
layer 2
layer L-1
Less
“reusable”
layer L
…
Clone these layers
along with their
weights and do not
modify the weights
FC-4096
FC-4096
FC-1000
Clone these layers
along with their
weights and do not
modify the weights
layer L
FC-4096
FC-2
layer 1
layer 2
layer L-1
layer L-1
FC-4096
Replace the
classification layer
and train it
Dog-vs-cat (2 classes)
Assume dataset of
moderate size
…
14M images
AI-vs-Artists (2 classes)
Assume small dataset
…
More
“reusable”
ImageNet (1000 classes)
Clone these layers
along with their
weights and retrain
them with an initially
low learning rate
Replace these
layers and train
them
layer L
FC-4096
FC-n_filters
FC-2
82
Outline
❖
Deep Learning Components
Fully connected layer
❖ Computation graph
❖ Activation functions
❖ Loss function
❖ Model optimization
❖ Gradient descent
❖ Backpropagation
❖ Stochastic gradient descent (SGD)
❖ Regularization
❖ ℓ and ℓ regularization
2
1
❖ Dropout
❖ Batch normalization
❖ Convolutional layer
❖ Pooling layer
❖
❖
Convolutional Neural Network (CNN)
83
Outline
❖
Deep Learning Components
Fully connected layer
❖ Computation graph
❖ Activation functions
❖ Loss function
❖ Model optimization
❖ Gradient descent
❖ Backpropagation
❖ Stochastic gradient descent (SGD)
❖ Regularization
❖ ℓ and ℓ regularization
2
1
❖ Dropout
❖ Batch normalization
❖ Convolutional layer
❖ Pooling layer
❖
❖
Convolutional Neural Network (CNN)
84
Outline
❖
Deep Learning Components
Fully connected layer
❖ Computation graph
❖ Activation functions
❖ Loss function
❖ Model optimization
❖ Gradient descent
❖ Backpropagation
❖ Stochastic gradient descent (SGD)
❖ Regularization
❖ ℓ and ℓ regularization
2
1
❖ Dropout
❖ Batch normalization
❖ Convolutional layer
❖ Pooling layer
❖
❖
Convolutional Neural Network (CNN)
85
Outline
❖
Deep Learning Components
Fully connected layer
❖ Computation graph
❖ Activation functions
❖ Loss function
❖ Model optimization
❖ Gradient descent
❖ Backpropagation
❖ Stochastic gradient descent (SGD)
❖ Regularization
❖ ℓ and ℓ regularization
2
1
❖ Dropout
❖ Batch normalization
❖ Convolutional layer
❖ Pooling layer
❖
❖
Convolutional Neural Network (CNN)
86
y
probability of
being class 1
x
probability of
being class 2
probability of
being class 3
activation functions
ReLU
Softmax
ReLU
# Import necessary modules
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create the model
model = keras.Sequential()
model.add(layers.Dense(3, activation="relu"))
model.add(layers.Dense(2, activation="relu"))
model.add(layers.Dense(3, activation=“softmax”))
# Compile the model
model.compile(
optimizer='adam',
loss=tf.keras.losses.CategoricalCrossentropy(),
)
Training loss
Validation loss
Iteration
Model
Loss function
and optimizer
# Train the model for 100 epochs with a batch size of 32
model.fit(x_train, y_train, batch_size=32, epochs=100, validation_data=(x_val,y_val))
87
Download