Uploaded by ay47757

Pattern Revision

advertisement
Lecture 1
(Introduction)
• AI includes any technique that mimics human behavior
• Classical AI (Excluding ML and DL)
– Rule based expert systems
– Knowledge reasoning
– No learning needed (No data is available)
– Simpler applications
• Machine Learning (statisical techniques)
– Programs learn from examples
– Improve performance based on experience
– Finding patterns in data
– Adapt to complex and changing environments
• Deep Learning
– Multi-layered neural networks
– end-to-end solution (Works on raw data)
– Required big data and high computational power
When to use Deep Learning?
1- Big amount of data (expensive)
2- Availability of high computational power (expensive)
3- Lack of domain understanding
4- Complex problems
Note: in deep learning the more amount of data the more performance but old learning
techniques is not the same.
Limitations of Deep Learning?
1- Lack of adaptability and generality compared to human vision
2- Large amount of labeled data needed (Expensive and sometimes need experts)
3- Datasets may be biased against rare patterns
4- Sensitive to standard adversarial attacks
5- Over-sensitive to changes in context
6- Combinational Explosion
7- Virtual understanding is tricky (Mirrors, Physics, Humor)
Main Types of Learning?
Supervised Learning
• Depends on labeled examples
• Learn the unknown function that corresponds outputs to inputs
• Error = f(target outputs - actual outputs)
• Modeling p(y|x)
Steps:
1- Raw Data
2- Pre-processing
3- Feature Extraction
4- Training a classifier
Unsupervised Learning
• Learn generative model of the input data that tries to describe possible patterns and
features of these inputs without the need for labeled outputs.
• Types:
– Clustering Problems
Given unlabeled data, group data points of alike features together
– Association Problems
Find dependencies within data, generate dependency rules for better
decisions
• Modeling p(x)
Steps of K-means (Clustering):
1- Select K points as centers of K clusters
2- Cluser reamining points w.r.t nearest center
3- Calculate the mean of each cluser
4- Repeat clustering based on means until no change
5- Compute sum of variations of each cluster (sum of squared differences)
6- Repeat several times starting from step 1
7- Consider clustering centers with min variations as best option for K clusters
8- Repeat the above algo starting with k = 1 and keep increasing k
9- Plot an elbow plot of the reduction in variations per k
10- Select the elbow of the plot as your K
Semi-Supervised Learning (Maximization problem)
• Few labeled data
• Lots of unlabeled data
• Make use of unlabeled data by the help of labeled data
Steps:
1- Build supervised learning model based on labeled data
2- Expectation step: Label unlabeled data using the model built in 1
3- Maximization step: Retrain the model based on all the data
4- Repeat starting from step 2 until convergence
Reinforcement Learning
• Training Info = evaluations(rewards/penalties)
• Objective of the agent is to get as much reward as possible
What is a pattern?
• A pattern is an abstract object, such as a set of measurements describing a physical
object
Why is object recognition a hard problem?
• Due to variability of objects
• Same objects can look different (rotated, sizes, different angle of camera)
Classification VS Regression?
• Both problems tend to learn an unkown function that maps outputs to inputs (function
approximation)
• Classification → Predict a label
• Regression → Predict a quantity
After collecting data and choosing the features which can effectively differentiate between the
classes, a density function is estimated.
• It is very difficult to design a classifier that yields no classification errors
• Classification errors occur due to the variability of the patterns typically encountered
• Multiple features reduces the classification errors
Lecture 2
(Training Patterns)
Feature Vector
X(m) =
N → Number of features
X1 (m)
X2 (m)
⋮
XN (m)
m → Represents the mth training pattern
M → Number of training patterns
Decision Regions
• The data points of each class occur in groupings or clusters in the feature space plot
The equation to represent the linear decision boundary :
W0 + W1 X1 + … + WN XN = 0
• W 0 and W i are the constant that determine the position of the hyperplane
Types of Problems
A problem is said to be linearly separable if there is a hyperplane that can separate the
training data points of class C 1 from those of C 2
Lecture 3
(Pattern Classification Methods)
Minimum Distance Classifier
Steps:
• Choose a center or representative pattern from each class V(k)
• Give a pattern X that we would like to classify
• Compute the Euclidean Distance from X to each center in V(k)
d 2 = (y𝟐 − y1) 2 + (x2 − x𝟏) 2
• Find the index of the mnimum distance
• To calculate the class center of a number of training patterns
Mi
1
∑ X(m)
V(i) =
Mi m=1
Disadvantages:
• Too simple to solve difficult problemes
• outlier affects the position if the mean of the class badly
• poor performance if there are overlapping between classes
Nearest Neighbor Classifier (NN)
The class of the nearest pattern to Xdetermines its classification
Steps:
• Compute the distance between pattern X and each pattern X(m) in the training set
• The class of the pattern m that corresponds to the minimum distance is chosen as the
classification of X
Advantages:
• simplicity
Disadvantages:
• Sensitive to outliers
• Patterns with large overlaps between the class can negatively affect performance
K-Nearest Neighbor Classifier (KNN)
** (if number of classes are odd → K =??)
Same to NN but taking k-nearest points into consideration
Less dependent on strange patterns compared to the nearest neighbour classification rule
Disadvantages:
• The neighbours could be a bit far away from X leading to using information that might
not be relevant
Lecture 4
(Bayes Classification Rule)
→ It classifies the data point (belonging to the most likely class) that makes it an optimal
classification rule
→ However, bayes classifier assumes that probability densities are known, which is not
usually the case.
→ Note: The a priori probabilities represent the frequencies of the classes irrespective of the
observed features (we calculate it from the training data labels)
→ Bayes classifier is a linear classifier iff the covarience matrix of all classes are the same.
Steps:
Given a pattern X (with unkown class) that we wish to classify
• Compute P(C 1 |X), P(C 2 |X), ........ P(C k |X) (we wish to calculate but we cant so we
use bayes rule)
• Find the k giving maximum P(C k |X)
Rule:
P(Ci |X) =
=
P(Ci , X)
P(X)
P(X|Ci ) P(Ci )
P(X)
P(Ck |X) → Posterior prob
P(Ck ) → apriori prob
P(X, Ck ) → class - conditional densities && class conditional probability function
Total Probability law (Marginalization):
P(x̱ ) =
∑ P(X|Ci ) P(Ci )
In reality P(X) doesn't have to be computed as it is a common factor
PCorrect:
(1)
PError:
P(error) = 1 - P(correct)
To get the decision boundary
P(X|C1 )P(C1 ) = P(X|C2 )P(C2 )
Gaussian Class-conditional densities
1-D
**𝜇i = E(X) =
X
∫
2𝜋𝜎
e
-
(X - u ) 2
Variance = E (X - 𝜇) 2
1
P(X|Ci ) =
2𝜋𝜎
e
-
2𝜎 2
= 𝜎2
(x - u i ) 2
2𝜎 2
Multi-dimensions
Independent Case
P(X |Ci ) =
e
1 N (X j - 𝜇 j ) 2
- ∑ j=1
2
2𝜎j2
N
(2𝜋) 2 𝜎
1 𝜎 2 ... 𝜎 N
Dependent Case
P(X|Cj ) =
1
- (X - 𝜇 ) T Σ -1 (X - 𝜇 )
e 2
N
1
2
(2𝜋) det 2 (Σ)
dx
all cov matrix are zeros except the main diagonal
Lecture 5
Density Estimation
Probability densities have to be estimated to apply bayes rule
Histogram Analysis
p(x) =
m
M * sizeOfBin
m → number of data points falling in the bin
M → total number of points that belongs to the same class
• Weak method of estimation
• Discontinuity of these density estimates, even though the true densities are assumed
to be smooth
Naive Estimator
--more accurate than histogram analysis--
h
P(X) =
h
#points falling in X - 2 , X - 2
Mh
• Discontinuity of the density estimates
• All data points are weighted equally regardless of their distance to the estimation point
Kernel Density Estimator (Parzen Window)
Steps:
• Choose a bump function
• Apply the bump function to each point
• Sum all bump functions
Choose bump functions as gaussian with standard deviation (bandwidth) h:
𝜙h (x) =
M
e
-x 2
2h 2
2𝜋h
1
∑ 𝜙h (X - X(m))
P(X|Ci) =
M m=1
• 𝜙 h does not have to be gaussian
𝜙h =
1 X
g
h h
• g(.) should integrate to 1 & known distribution
• Check Parzen window estimator as a bump function
• Note:
– Naïve estimator is equivalent to a Parzen window estimator with:
g(x)
• Multi-dimension Form:
= 1 - 0.5 < x < 0.5
= 0 otherwise
How to choose h
Hi = 𝜎i
4
(N + 2)M
𝜎i =
[Σx ]i,i
1
N+4
M
1
∑ (X(m) - 𝜇)(X(m) - 𝜇) T
Σx =
M m=1
N
hopt
1
= ∑ Hi
N i=1
Lecture 6
Feature Selection & Extraction
• If we use too many features we will suffer from curse of dimensionality
• Generalization ability: Is the ability of the classifier to generalize well for data it had
not seen before
Curse Of Dimensionality:
• The parameters’ estimates (mean vector and covariance matrix) will not be very
accurate
• The data points will become scattered and clustering of classes will not be clear
Types of Features to be Removed
- Irrelevant Features
- Correlated Features:
– Features that vary very closely with other features (Ex: #corners & #lines)
Feature Selection
• From the available N features choose a subset of L (≪N)
Sequentail Forward Selection Algorithm (SFS):
1- Examine each feature Xi, take alone, design the classifier and select the best feature
2- From the reamining features, select the one that together with the selected group,
gives
the best performance
3- Continue in this manner until you have selected L features
Sequential Backward Selection Algorithm(SBS):
1- Start with all features selected. Remove one feature at a time, design the classifier
and
compute Pc
2- The feature we end up removing is the one that results in the least reduction of Pc
3- Keep removing till reaching the no. L selected features
Methods of Feature Selection:
• Filter type: select features without looking into the classifier you are going to use.
– Advantages → Faster Execution & Generality
– Disadvantages → Tendency to select large subsets
• Wrapper Type: select features by taking into consideration the classifier you will
use(SFS & SBS)
– Advantages → Accuracy
– Disadvantages →Slow Exectuion & Lack of Generality
Feature Extraction
• Transform the available N features into a smaller no. of L features through certain
transformations (usually linear)
• Transformed features may have no physical meaning → explaining the model is
problematic
• May not be suitable to every domain
Principal Component Analysis (PCA):
• In PCA, the most interesting directions are those having large variations (large
variance)
Steps:
- Transform data to zero mean (Y = X - 𝜇 )
- Estimate covariance matrix from data
M
1
Σ = ∑ Y(m)Y(m) T
M m=1
- Compute eigen values and vectors (principle component) for Σ
(A - 𝜆I) * X = 0
|A - 𝜆I| = 0
- Choose the eigen vectors corresponding to the big eigen values
- Perform transformation of data to a new space
u1T
Z=
u2T
⋮
uLT
Y
𝑍 = 𝐴𝑌
Cov(Z) = 𝐴Σ𝐴 T
= U T ΣU
= Ω
Σ𝐔 = 𝑼𝛀
𝑈 T 𝛴𝑈 = 𝛺
To choose L (the new number of features):
• Pick largest 4-5 eigen values and sum them then dvide by the sum of all eigen values
• This will give the accuracy.
• Repeat picking an extra eign value in the numerator sum untill reaching the desired
accuracy (99%).
Lecture 7
Classifiers Combinations
Why classifiers combination?
Best classifier for the training set might not be the best for the test set, so it is risky to choose
the best in training set. As a result, it is better to combine classifiers with high diversity.
Ways to combine
• Majority Vote
• Average class posterior propabilities
• Average some sort of score function
• Can also choose median instead of average
Adaboost
• Iterative procedure
• Tries approximate the bayes classifier by combining many best weak classifiers to
create a strong classifier
• AdaBoost works well only in case of binary classification
• AdaBoost assumes that the error of each weak classifier is less than 0.5
– 𝛼 t is negative if error is greater than 0.5 (log(<0) = negative number )
– The error of random guessing in case of two classes is 0.5
– In case of K multi classes the random guessing error rate is
Multi Class AdaBoost
𝛼t = log
1 - err
+ log(k - 1)
err
∴ 𝛼t is positive only
1
(1 - err) >
k
k-1
k
Weak Classifier: is able to guess the right class with a percentage slightly bigger than
random guessing
Strong Classifier: is usually correct > 80%
Steps:
1- Initialize weights of the training examples:
wm =
1
M
2- For t=1 to T:
• Select a classifier h t that best fits to the training data using weights w m of the training
examples
• Compute error of h t as: err t =
∑ M wm (cm ≠ ht (xm ))
m=1
∑ M wm
m=1
• Compute weight of classifier: 𝛼 t = log
(1 - errt )
errt
• Update weights of wrongly classified examples:
wm = wm e 𝛼t (cm ≠ht (xm ))
• Renormalize weights w m
∑ 𝛼t (ht (x)
3- Output: C(x) argmax
Overfitting:
= k)
• AdaBoost is robust to overfitting given that select the best weak classifiers
• However, relying on complex classifiers will be more prone to overfitting
Lecture 8
GMM
Why GMM?
•Assume we have a small data set, not possible to estimate class conditionals using kernel
density estimator. Instead, we model each class conditional as a sum of multivariate
Gaussian densities.
K
P( X ) =
∑ wj
j=1
-1
1
- (X-𝜇 ) T ∑ j (X-𝜇 )
j
j
e 2
N
1
(2𝜋) 2 det 2
K
= ∑ w j N X, 𝜇 j , Σ j
j=1
K
∑ wj = 1
j=1
∑
j
Issues with GMM
1. Initialization
• Expectation–Maximization (EM) an iterative algorithm which is very sensitive to initial
conditions. (Start from trash →end up with trash)
• Usually we use K-Means to get a good initialization.
2. Determining Number of Gaussian Components
• 1 < k < M // K = number of gaussians
• Use information-theoretic criteria to obtain the optima K
• Methods:
o MDL (Minimum Description Length)
o AIC (Akaike’s Information Criterion)
o BIC (Bayesian Information Criterion)
o MML (Minimum Message Length)
Lecture 9
Decision Trees & Random Forests
Building a Decision Tree
split(node, {training examples of that node}):
1- X <- Get best attribute to split examples
2- For each value of X create a child node
3- Split examples to each child node
if(subset of examples is pure):
stop()
else:
split(childNode, {subset of examples})
Which attribute to choose for splitting?
You should get heavily biased subsets (decrease the uncertainity), and should put the purity
of the split into consideration
Entropy
H(S) = - pyes lg(pyes ) - pno lg(pno )
H(s) → Entropy of example subset s
pyes → % of yes examples within subset s
pno → % of no examples within subset s
Information Gain
∑
Gain(S, X) = H(S) -
v 𝜖 values(X)
|Sv |
|S|
H(Sv )
Overfitting
If we split the decision tree until all training examples are correctly classified, all leaf nodes
will be pure, even if they have just one example (singletons), this will cause overfitting and
the model can't generalize on new data
To avoid overfitting:
Way 1:
Stop splitting when not statistically significant
Way 2 (Post Prune) Better way:
1- For each node (ignoring leaf nodes):
i. Consider removing the node and all its children
ii. Measure performance on validation set
2- Remove node that leads to best improvement
3- Repeat until further removals are harmful
NB: Greedy approach but not optimal
Gain Ratio
SplitEntropy(S, X) =
∑
v 𝜖 values(X)
GainRatio(S, X) =
|Sv |
|S|
lg
|Sv |
|S|
Gain(S, X)
SplitEntropy(S, X)
Continuous Attributes
• Continuous attributes can be repeated, unlike discrete attributes
• Real values of attributes are sorted and average of each two adjacent examples is a
threshold to be considered
Entropy in muilti - class classification
H(S) = - ∑ pi lg(pi )
i
Regression:
• Predicted output → avg. of training examples in the subset (or linear regression at
leaves)
• Minimize variance in subsets (instead of maximize gain)
Pros & Cons
Pros
• Interpretable (Not a black box)
• Easily handles irrelevant attributes (Gain = 0)
• Can handle missing data
• Very compact
• Very fast at testimg time (O(depth))
Cons
• Greedy so may not find best tree
• Only axis aligned of splits of data (Continuous data)
Uniform Sample with Replacement
Uniform sampling is that the selection probability of any example in S is
1
|S|
Replacement means that a selected item can be reselected multiple times for the same
subset
Random Forest
Steps:
• M is number of traning examples
• Uniformly sample T subsets(each of size M) with replacement
• Build T decision trees with zero training error
• Take the average/votes of T trees
Building Tree:
• D is the number of features
• Sample K features randomly (K < D)
• Only split on these K random features
• New K features are sampled for every single split
Notes:
- This means that different trees are built so different mistakes by each tree
- No need to tune hyperparameters
- No need to pre-process or scale inputs
- Increase T as much as you can afford (Parallel Processing)
- No need for training/validation split
- Can estimate test error directly from training set
- Second best approach
- Not suitable for raw images
- Improvement: Prune the last split of trees to decrease the size of trees and decrease noise
- Computer error for each training example (Consider only trees that do not include that
example in their training subset 60%)
K = ⏲⏳
⏳
D⏴⏵
⏵
Error Calculations:
M
EOFB
T
1
∑ 1 ∑ loss(hj (xi ), yi )
=
M i=1 Zi (xi, ,yi )
T
∑
Zi =
1
j
(xi ,yi )∉ Sj
Neural Networks
• Weights are the parameters that encode the information in the brain
• Brain is superior because of the massive parallelism
• Weights act like the “storage” in computers
• When a human encounters a new experience the weights of his brain gets adjusted
• Memories are encoded in the weights
• The weights determine the functionality of the model
• The neuron can implement a linear classifier
Augmented Vectors:
u(m) =
W =
W0
w
w=
W1
W2
⋮
WN
1
X(m)
=
1
X1 (m)
⋮
XN (m)
(2)
T
y(m) = f W u(m)
• It can be used for non linearly separable problems as long as it produces a low
classification error rate
Neural Networks Steps (Linear Preceptron Algorithm):
1- Initialize the weights and threshold (bias) randomly
2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its
corresponding desired output d(𝑚)
T
3- Calculate the actual output for pattern m → y(m) = f W u(m)
4- Adapt the weights according to the following rule (called WidrowHoff rule):
W(new) = W(old) - 𝜂[d(m) - y(m)]u(m)
𝜂 → Learning rate
5- Go to step 2 until all patterns are classified correctly, i.e.,d(m)=y(m) for m=1, … , M
Least Square Classifier
• If the problem is not linearly separable, then the algorithm will not converge & will keep
cycling forever that is why we need least square classifier (Rosenblatt theorem)
• Define an error function
M
E =
∑
m=1
W T u(m) - bm
2
It measures how close the obtainedsolution is to the desired one.
• We then seek to minimize the error function by finding W that minimizes the E
• Define the gradient vector
∂E
∂W0
∂E
∂W1
⋮
∂E
∂WN
–
∂E
=
∂W
–
∂E
= 0 and solve for W
∂W
– Advantages:
* Can converge if the problem is not linearly spearable
– Disadvantages:
* Linear classifiers don't solve all problems
Multi-Layer Network
• The powerful feature of multilayer networks (aka. feed forward networks) is its ability to
learn
• We usually use hidden node fn’s (Activation) that are continuous
Gradient Descent
• We use the concept of steepest descent
• To update the weights
W(new) = W(old) - 𝛼
∂E
∂W
𝛼 → The learning rate
Back Propagation Algorithm
• It is an algorithm based on the steepest descent concept
• Used to train multilayered networks
Steps:
1- Initialize the weights and threshold (bias) randomly [-1,1]
2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its
corresponding desired output d(𝑚)
3- For m=1 to M:
• Present u(m) to the network and compute the hidden layer outputs and final layer
outputs
• Use these outputs in a backward scheme to compute the partial derivatives of error fn.
w.r.t. to the weights of each layer
• Update the weights → W i[,Lj ] (new) = W i[,Lj ] (old) - 𝛼
∂Em
∂Wi[,Lj ]
4- Computer the total error and stop in case of convergance
Disadvantages of Back Propagation
• Too small 𝛼 → Very small steps and reach the min slowly
• Too large 𝛼 → Leads to oscillations and possibly not converge at all
• Use variable 𝛼 (Start with large then decrease it) → Good range is between 0.001 and
0.05
• Prone to get stuck in local minima
– To avoid this problem, repeat the training several times, each time with different
set of initial weights
Types of Weight updates → Check answers below
Other Optimization Algorithms
• Gradiet Descent with momentum
– Smooth out the steps of the gradient descent using a moving average of the
derivatives, so it avoid oscillations
𝑫𝑾 =
𝜷𝑫𝑾 + (𝟏 − 𝜷)
𝟏−𝜷
𝝏𝑬
𝝏𝑾
W = W - 𝛼DW
DW = 0 initially
• RMSProp
Sw =
𝜷Sw + (𝟏 − 𝜷)
𝝏𝑬
𝝏𝑾
2
𝟏−𝜷
W = W -𝛼
𝝏𝑬
𝝏𝑾
Sw + 𝜖
• Adam
𝜷Sw + (𝟏 − 𝜷)
Sw =
Dw =
𝝏𝑬
𝝏𝑾
2
𝟏−𝜷
𝜷Dw + (𝟏 − 𝜷)
𝝏𝑬
𝝏𝑾
𝟏−𝜷
W = W -𝛼
Dw
Sw + 𝜖
Regularization
• Used to prevent overfitting
• Intuition: set the weights of some hidden nodes to zero to simplify the network
M
1
∑ Em + 𝜆 ||W||22
• L 2 Regularization →
M m=1
2M
M
1
∑ Em + 𝜆 ||W||1
• L 1 Regularization →
M m=1
2M
• L 2 regularization is used more often
• 𝜆 is the regularization parameter (hyper parameter)
Dropout Regularization
• Every epoch → shutdown random number of neurons
• Not all nodes get trained every epoch
• No neuron get overfitted
• Simpler Model
• More generalized
Input & Output Normalization
• Inputs have to be approximately in the range of 0 to 1 or -1 to 1
Machine Learning Recipe
Underfitting
Normal
Overfitting
Training Set
Large Error
Small Error
Small Error
Validation Set
Large Error
Small Error
Large Error
Hyper-parameters Tuning
• 1st
– Learning Rate
• 2nd
– Momentum Parameter → 𝛽 = 0.9
– Number of hidden nodes
– Mini batch size
• 3rd
– Number of hidden layers
– Learning rate decay
• Adam Parameters
Tuning Process
• Try random values: don’t use a grid
• Coarse to fine scheme (Focus on the good regions)
• Use appropriate scale
– Don't sample uniformly
– Use logarithmic scale
Machine Learning Model Selection
1- Categorize the problem
• Input → Supervised? Unsupervised?
• Output → Numerical? Regression?
2- Understand your data
• Analyze the data
• Process the data
• Feature Engineering
3- Determine the possible algorithms
4- Implement the machine learning algorithms
5- Tune hyperparameters
Time Series Prediction
• Remove the seasonal periodicities
TS → Deseasonalize → Predict → Return back seasonal components
a(year) =
1
∑ x(t)
12 window
x(i)
a(year)
Z(i) =
u(i) =
∑ Zj (i)
j
xdeseasonal (t) =
#years
x(t)
u(month(t))
What would have changed in the previous question if the neural network was solving a
regression problem rather than a classification problem?
The difference will be in the loss function of the model.
In case of classification problem → L = - y log a [2] + (1 - y) log 1 - a [2]
(Cross
Entropy)
M
In case of regression problem →
1
∑ a i[2] - yi
L=
M i=1
L=
2
(MSE)
1 M
∑ (y i - y ) 2
i
M i=1
(RMSE)
One more modified equation will be
M
∂L
2
∑ a [2] - yi
=
[
2
]
M i=1
∂a
What would have changed in the previous question if the neural network was solving a
multi-class classification problem rather than a binary classification problem? Write
down the modified equations.
• The output layer will consist of K nodes where K is number of classes
• We need to change the sigmoid activation function in the final output layer to softmax
Activation Function, as it will result for K probabilities that will sum up to 1
• We need to change the derivative of softmax.
yi =
e zi [L]
∑ e zj [L]
j
What does the learning rate (alpha parameter) mean? How does changing the learning
rate
affect the training process?
• It controls the amount of apportioned error that the weights of the model are updated
with each time they are updated
• Too small alpha → Convergence will be slow
• Too large alpha → Convergence will oscillate around the minimum
• Good range is between 0.001 and 0.05
W(new) = W(old) - 𝛼
∂E
∂W
What is the difference between batch gradient descent and sequential (stochastic)
gradient
descent?
Sequential (Stochastic) Gradient Descent (SGD)
Initialize weights and biases randomly.
for i = 1: N iterations → Training Loop (epochs)
for j = 1: M (M: # examples):
1. Forward Propagation
2. Compute the loss function.
3. Backward Propagation.
4. Update the weights and biases.
Adantages:
• Faster in update compared to gradient descent
Disadantages:
• Hard to converge since it depends on every single example
• Loss speedup from vectorization
Batch Gradient Descent (BGD)
Initialize weights and biases randomly.
for i = 1: N iterations → Training Loop (epochs)
for j = 1: M (M: # examples):
1. Forward Propagation
2. Accumulate to the loss
3. Backward Propagation. (with averaging over M)
4. Update the weights and biases.
Adantages:
• Optimization is more consistent
Disadantages:
• Slow (too long per iteration)
Mini-Batch Gradient Descent (BGD)
Initialize weights and biases randomly.
for i = 1: N iterations → Training Loop (epochs)
for j=1: minibatches
for k = 1: M (M: # examples in the jth minibatch):
1. Forward Propagation
2. Accumulate to the loss
3. Backward Propagation. (with averaging over M)
4. Update the weights and biases.
Adantages:
• Fast
Mention four different types of activation functions. Write down the mathematical
expression for each of them as well as their derivatives. Mention the advantage(s) and
disadvantage(s) of each of them.
Sigmoid →
1
1 + e -x
Slow learning
Used only in output layer in binary classification problems
Unfortunately it has 0 gradient in some parts
Tanh →
e x - e -x
e x + e -x
Comes next after ReLu in popularity
More famous with sequence problems (Speech Recognition)
Unfortunately it has 0 gradient in some parts
ReLu → f(x) = max(x,0)
Leaky ReLu → f(x) = x @x>0 & 𝛼x otherwise
- Most famous now
- Faster learning
- Leaky relu is better than relu because no 0 gradient in the negative part
What are the hyperparameters in the gradient descent update algorithm? How to
select these hyperparameters?
• Learning rate 𝛼
• Size of the mini batch
• Number of hidden nodes
• Number of layers
Tuning Process:
1- Try random values
2- Corase to fine scheme
3- Use appropriate scale (Don't sample uniformly, Use logarithmic scale)
What is the difference between training set, test set and validation set? Are there any
guidelines in selecting each of them?
Training Set: Here, you have the complete training dataset. You can extract features and
train to fit a model and so on.
Validation Set: This is crucial to choose the right parameters for your estimator. We can
divide the training set into a train set and validation set. Based on the validation test results,
the model can be trained(for instance, changing parameters, classifiers). This will help us get
the most optimized model.
Testing Set: Here, once the model is obtained, you can predict using the model obtained on
the training set.
Best Practice for selecting the sets:
• Training → 60%
• Validation → 20%
• Testing → 20%
In case of big data:
• Training → 98%
• Validation → 1%
• Testing → 1%
Mention the difference between overfitting and underfitting? Give an example to each
of them.
Underfitting
Normal
Overfitting
Training Set
Large Error
Small Error
Small Error
Validation Set
Large Error
Small Error
Large Error
What are different optimization algorithms? State the weight update equation for each
of these optimizers.
• Gradient Descent without momentum
W(new) = W(old) - 𝛼
∂E
∂W
• Gradiet Descent with momentum
– Smooth out the steps of the gradient descent using a moving average of the
derivatives, so it avoid oscillations
𝑫𝑾 =
𝜷𝑫𝑾 + (𝟏 − 𝜷)
𝝏𝑬
𝝏𝑾
𝟏−𝜷
W = W - 𝛼DW
DW = 0 initially
• RMSProp
- Slow down learning in unintended directions
- Avoid oscillations
𝜷Sw + (𝟏 − 𝜷)
Swi =
2
𝟏−𝜷
𝜷Sw + (𝟏 − 𝜷)
Swj =
𝝏𝑬
𝝏𝑾i
𝝏𝑬
𝝏𝑾j
𝟏−𝜷
Wi = Wi - 𝛼
Wj = Wj - 𝛼
𝝏𝑬
𝝏𝑾i
Swi + 𝜖
𝝏𝑬
𝝏𝑾j
Swj + 𝜖
• Adam (combines both RMSProp & momentum)
2
𝜷Sw + (𝟏 − 𝜷)
Swi =
2
𝟏−𝜷
𝜷Dwi + (𝟏 − 𝜷)
Dwi =
𝝏𝑬
𝝏𝑾i
𝝏𝑬
𝝏𝑾i
𝟏−𝜷
Wi = Wi - 𝛼
Dwi
Swi + 𝜖
What is the main difference between gradient descent and gradient descent with
momentum?
With Stochastic Gradient Descent we don’t compute the exact derivate of our loss function.
Instead, we’re estimating it on a small batch. Which means we’re not always going in the
optimal direction, because our derivatives are ‘noisy’.
Gradient descent with momentum smoothes out the steps of the gradient descent using a
moving average of the derivatives, so it avoid oscillations and faster learning, as it uses a
moving out average to take into considereation the previous results.
Mention two different ways to reduce overfitting. Explain how each of them reduces
overfitting.
Regularization
• Used to prevent overfitting
• Intuition: set the weights of some hidden nodes to zero to simplify the network
M
1
∑ Em + 𝜆 ||W||22
• L 2 Regularization →
M m=1
2M
M
1
∑ Em + 𝜆 ||W||1
• L 1 Regularization →
M m=1
2M
• L 2 regularization is used more often
• 𝜆 is the regularization parameter (hyper parameter)
Dropout Regularization
• Every epoch → shutdown random number of neurons
• Not all nodes get trained every epoch
• No neuron get overfitted
• Simpler Model
• More generalized
MCQ Questions:
1. Which hyperparameter of the following needs to be tuned first in a typical neural network
problem?
a. Momentum parameter.
b. Mini batch size
c. Learning rate
d. Number of hidden nodes in each layer.
2. The softmax layer is only used in
a. Binary classification problems.
b. Multiclass classification problems.
c. Regression problems.
d. All of the above.
3. As the number of hidden layers increases in a neural network,
a. The time needed to train the network decreases.
b. The network learns more complex functions and features.
c. The network converges to a local minimum of the cost function faster.
d. The network may be subject to overfitting.
4. The activation function recommended to use when working with images is
a. ReLU function.
b. Step function.
c. Sigmoid function. → Final layer of binary classification problems
d. Tanh function. → For Speeches
5. The neural network that tries to match two given inputs and detect how similar or different
they are from each other is called
a. Convolutional neural network.
b. Siamese network.
c. Recurrent neural network.
d. Generative Adversarial Network.
6. A network with a skip connection from output layer to input layer is called
a. Convolutional neural network.
b. Siamese network
c. Recurrent Neural Network.
d. Generative Adversarial Network.
7. A neural network used mainly to generate features from input images and represent these
images in a compressed low dimensional space is called:
a. Convolutional neural network.
b. Auto Encoder network.
c. Recurrent Neural Network.
d. Siamese Network.
8. If the neural network is subject to overfitting, then we can reduce the effect of overfitting by:
a. Increasing the size of the training data.
b. Increasing the size of the neural network.
c. L2- Regularization.
d. Dropout regularization.
9. A time series is composed of:
a. Trend
b. Seasonality
c. Random Noise
d. All of the above
10. For the neural network to learn functions such as XOR and XNOR, it is sufficient to have:
a. 1 input layer and 1 output layer.
b. 1 input layer, 1 hidden layer, 1 output layer.
c. 1 input layer, 2 hidden layers, 1 output layer.
d. It is dependent on the number of inputs, and so it is impossible to tell.
QUESTIONS
Define the Machine Learning Recipe
Underfitting
Normal
Overfitting
Training Set
Large Error
Small Error
Small Error
Validation Set
Large Error
Small Error
Large Error
What’s the key idea of Adaboost algorithm? Explain using a diagram
Tries approximate the bayes classifier by combining many best weak classifiers to create a
strong classifier
Weak Classifier: is able to guess the right class with a percentage slightly bigger than
random guessing
Strong Classifier: is usually correct > 80%
Why can’t adaboost work for more than binary classification? How would you modify
it to work for more?
AdaBoost assumes that the error of each weak classifier is less than 0.5 ∴ 𝛼 t is negative if
error is greater than 0.5
To overcome this problem, the equation to calculate 𝛼 t is updated to:
𝛼t = log
1 - err
+ log(k - 1)
err
∴ 𝛼t is positive only
1
(1 - err) >
k
What’s the difference between wrapper method and filter method for feature selection?
Methods of Feature Selection:
• Filter type: select features without looking into the classifier you are going to use.
– Advantages → Faster Execution & Generality
– Disadvantages → Tendency to select large subsets
• Wrapper Type: select features by taking into consideration the classifier you will use(
SFS & BFS)
– Advantages → Accuracy
– Disadvantages →Slow Exectuion & Lack of Generality
Compare between AI, ML, & DL
• AI includes any technique that mimics human behavior
• Classical AI (Excluding ML and DL)
– Rule based expert systems
– Knowledge reasoning
– No learning needed (No data is available)
– Simpler applications
• Machine Learning
– Programs learn from examples
– Improve performance based on experience
– Finding patterns in data
– Adapt to complex and changing environments
• Deep Learning
– Multi-layered neural networks
– end-to-end solution (Works on raw data)
– Required big data and high computational power
If u train a neural network and get 54% training accuracy and 51% validation accuracy,
explain what u will do next
This means the neural network is underfitted so to overcome this problem:
• Use larger network (more nodes or more layers)
• Traing for longer time
Explain with examples the linear perceptron update rule
Widrow Hoff Rule
W(new) = W(old) + 𝜂[d(m) - y(m)]u(m)
𝜂 → Learning rate
If d(m)=y(m) then no change is needed in the weights.
To show that each iteration corrects errors:
– Let actual class 𝒅(𝒎)=𝟏
T
– If neuron classification y(m) = f W u(m)
= 0
– So, W T u(m) < 0
– However, we want 𝒚(𝒎)=𝟏
–We need to correct wrong classification by making what is inside 𝒇(∙)more positive, which
will make 𝐲(m) move likely to be 1
W(new) = W(old) + 𝜂[d(m) - y(m)]u(m)
T
y(new) = f Wnew u(m)
T
y(new) = f( Wold . u(m) + 𝜂[d(m) - y(m)]u T (m)u(m)
T
y(new) = f( Wold . u(m) + 𝜂[1 - 0]u T (m)u(m)
T
y(new) = f( Wold . u(m) + 𝜂||u(m)|| 2
Explain the naive estimator method, write the formula used and compare it to
histogram analysis
Naive Estimator
h
P(X) =
h
#points falling in X - 2 , X + 2
Mh
• Discontinuity of the density estimates
• All data points are weighted equally regardless of their distance to the estimation point
Histogram Analysis
p(x) =
m
M * sizeOfBin
m → number of data points falling in the bin
M → total number of points that belongs to the same class
• Weak method of estimation
• Discontinuity of these density estimates, even though the true densities are assumed
to be smooth
Explain the kernel density estimation technique with multidimensional case equations.
To apply the kernel density estimation technique, we need the following:
1. Bump Function (g(x))
• We can model the bump function as follows in the multidimensional case
2. Choosing Optimal h (diagonal bandwidth matrix)
• We get H i in each dimension with the normal reference rule:
Then we get the average and use it as our h opt
Explain why relu is used instead of sigmoid
The constant gradient of ReLUs results in faster learning than sigmoid
The reduced likelihood of the gradient to vanish.
Explain the least square classifier. No need for proof
Least Square Classifier
• If the problem is not linearly separable, then the algorithm will not converge & will keep
cycling forever that is why we need least square classifier (Rosenblatt theorem)
• Define an error function
M
E =
∑
m=1
W T u(m) - bm
2
It measures how close the obtainedsolution is to the desired one.
• We then seek to minimize the error function by finding W that minimizes the E
• Define the gradient vector
∂E
∂W0
∂E
∂W1
⋮
∂E
∂WN
–
∂E
=
∂W
–
∂E
= 0 and solve for W
∂W
– Advantages:
* Can converge if the problem is not linearly spearable
– Disadvantages:
* Linear classifiers don't solve all problems
What are the main issues in GMM?
1. Initialization
• Expectation–Maximization (EM) an iterative algorithm which is very sensitive to initial
conditions. (Start from trash →end up with trash)
• Usually we use K-Means to get a good initialization.
2. Determining Number of Gaussian Components
• Use information-theoretic criteria to obtain the optima K
• Methods:
o MDL (Minimum Description Length)
o AIC (Akaike’s Information Criterion)
o BIC (Bayesian Information Criterion)
o MML (Minimum Message Length)
Types of features to be removed in feature selection method
Types of Features to be Removed
- Irrelevant Features
- Correlated Features:
– Features that vary very closely with other features (Ex: #corners & #lines)
Explain regularization and why it is used
Regularization is used to avoid overfitting
Regularization
• Used to prevent overfitting
• Intuition: set the weights of some hidden nodes to zero to simplify the network
M
1
∑ Em + 𝜆 ||W||22
• L 2 Regularization →
M m=1
2M
M
1
∑ Em + 𝜆 ||W||1
• L 1 Regularization →
M m=1
2M
• L 2 regularization is used more often
• 𝜆 is the regularization parameter (hyper parameter)
Dropout Regularization
• Every epoch → shutdown random number of neurons
• Not all nodes get trained every epoch
• No neuron get overfitted
• Simpler Model
• More generalized
Input & Output Regularization
• Inputs have to be approximately in the range of 0 to 1 or -1 to 1
Explain how bayes rule can be used in classification
Given a pattern 𝑋(with unknown class) that we wish to classify
• Compute 𝑃(𝐶1|𝑋), 𝑃(𝐶2|𝑋), … , 𝑃(𝐶𝐾|𝑋)
• Find the k giving maximum 𝑃(𝐶𝑘|𝑋)
• P(C k |X) → posterior probaility
• P(C k ) → a priori probability
• P(X|C k ) → Class conditional density
P(Ck |X) =
P(X|Ck ) P(Ck )
P(X)
This is our classification according to the Bayes classification rule
It is an optimal classification rule, The reason is that it chooses the most likely class so
nothing could be better
Explain the importance of deseasonalization using diagrams.
It removes the seasonal periodicities to predict trends correctly
After trend prediction, seasonality can be recovered via multiplication by the corresponding
seasonal average
Explain steps of back propagation, and state its disadvantages.
Steps:
1- Initialize the weights and threshold (bias) randomly [-1,1]
2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its
corresponding desired output d(𝑚)
3- For m=1 to M:
• Present u(m) to the network and compute the hidden layer outputs and final layer
outputs
• Use these outputs in a backward scheme to compute the partial derivatives of error fn.
w.r.t. to the weights of each layer
• Update the weights → W i[,Lj ] (new) = W i[,Lj ] (old) - 𝛼
∂Em
∂Wi[,Lj ]
4- Computer the total error and stop in case of convergance
Disadvantages of Back Propagation
Can often be slow in reaching the min
• Too small 𝛼 → Very small steps and reach the min slowly
• Too large 𝛼 → Leads to oscillations and possibly not converge at all
• Prone to get stuck in local minima
To overcome the disadvantages:
• Use variable 𝛼 (Start with large then decrease it) → Good range is between 0.001 and
0.05
• To avoid this problem, repeat the training several times, each time with different set of
initial weights
Explain the three methods used to update weights.
Sequential (Stochastic) Gradient Descent (SGD)
Initialize weights and biases randomly.
for i = 1: N iterations → Training Loop (epochs)
for j = 1: M (M: # examples):
1. Forward Propagation
2. Compute the loss function.
3. Backward Propagation.
4. Update the weights and biases.
Adantages:
• Faster in update compared to gradient descent
Disadantages:
• Hard to converge since it depends on every single example
• Loss speedup from vectorization
Batch Gradient Descent (BGD)
Initialize weights and biases randomly.
for i = 1: N iterations → Training Loop (epochs)
for j = 1: M (M: # examples):
1. Forward Propagation
2. Accumulate to the loss
3. Backward Propagation. (with averaging over M)
4. Update the weights and biases.
Adantages:
• Optimization is more consistent
Disadantages:
• Slow (too long per iteration)
Mini-Batch Gradient Descent (BGD)
Initialize weights and biases randomly.
for i = 1: N iterations → Training Loop (epochs)
for j=1: minibatches
for k = 1: M (M: # examples in the jth minibatch):
1. Forward Propagation
2. Accumulate to the loss
3. Backward Propagation. (with averaging over M)
4. Update the weights and biases.
Adantages:
• Fast
Describe how CNNs work, and why they have less memory footprints.
It is mostly applied to imagery problems
- Layers extract features from input images,
- Convolution layer, i.e., filtering
- Pooling Layer, i.e., reduce input (avg or max)
- Fully Connected Layer, i.e., as in multi layer NN, at the final layers
Why less memory footprints?
- Parameter sharing (compared to fully connected layers)
- Sparsity of connections (The pixel at the next layer is not connected to all the 100 from
the first layer)
Discuss 3 limitations of Deep Learning.
1. Not a magic tool!
• Lack of adaptability and generality compared to human vision system
• Not able to build general intelligent machine
2. Can’t fit all real-world scenarios
• Infinite Variables
3. Large amount of labeled data can lead to impressive achievements correspond to
supervised learning but
• Expensive!
• Sometimes experts & special equipment are needed
4. Datasets may be biased
• Deep Networks become biased against rare patterns
• Serious consequences in some real-world applications (e.g., medical, automotive, …
etc.)
• Classification may be sensitive to viewpoint if one of the viewpoints is under
represented
• Solution: Researchers should consider synthetic generation of data to mitigate the
unbalanced representation of data
5. Sensitive to standard adversarial attacks
• Datasets are finite and just represent a fraction of all possible images
• Solution: Add extra training, i.e., “adversarial training”
6. Over sensitive to changes in context
• Limited number of contexts in dataset, i.e., monkey in jungle
• Combinatorial Explosion!
7. Combinatorial Explosion
• Real world images are combinatorial large
• Application dependent (e.g., medical imaging is an exception)
• Considering compositionality may be a potential solution
• Testing is challenging (consider worst case scenarios)
8. Visual understanding is tricky
• Mirrors
• Sparse Information
• Physics
• Humor
9. Unintended results from fitness function
Explain the K-folds validation method. In what context is it used
Used for parameter tuning over the training dataset.
In this technique, the parameter K refers to the number of different subsets that the given
data set is to be split into. Further, K-1 subsets are used to train the model and the left-out
subsets are used as a validation set.
Steps involved in the K-fold cross-validation:
1. Split the data set into K subsets randomly
2. For each one of the developed subsets of data points
3. Treat that subset as the validation set
4. Use all the rest subsets for training purpose
5. Training of the model and evaluate it on the validation set or test set
6. Calculate prediction error
7. Repeat the above step K times i.e., until the model is not trained and tested on all subsets
8. Generate overall prediction error by taking the average of prediction errors in every case
Download