Lecture 1 (Introduction) • AI includes any technique that mimics human behavior • Classical AI (Excluding ML and DL) – Rule based expert systems – Knowledge reasoning – No learning needed (No data is available) – Simpler applications • Machine Learning (statisical techniques) – Programs learn from examples – Improve performance based on experience – Finding patterns in data – Adapt to complex and changing environments • Deep Learning – Multi-layered neural networks – end-to-end solution (Works on raw data) – Required big data and high computational power When to use Deep Learning? 1- Big amount of data (expensive) 2- Availability of high computational power (expensive) 3- Lack of domain understanding 4- Complex problems Note: in deep learning the more amount of data the more performance but old learning techniques is not the same. Limitations of Deep Learning? 1- Lack of adaptability and generality compared to human vision 2- Large amount of labeled data needed (Expensive and sometimes need experts) 3- Datasets may be biased against rare patterns 4- Sensitive to standard adversarial attacks 5- Over-sensitive to changes in context 6- Combinational Explosion 7- Virtual understanding is tricky (Mirrors, Physics, Humor) Main Types of Learning? Supervised Learning • Depends on labeled examples • Learn the unknown function that corresponds outputs to inputs • Error = f(target outputs - actual outputs) • Modeling p(y|x) Steps: 1- Raw Data 2- Pre-processing 3- Feature Extraction 4- Training a classifier Unsupervised Learning • Learn generative model of the input data that tries to describe possible patterns and features of these inputs without the need for labeled outputs. • Types: – Clustering Problems Given unlabeled data, group data points of alike features together – Association Problems Find dependencies within data, generate dependency rules for better decisions • Modeling p(x) Steps of K-means (Clustering): 1- Select K points as centers of K clusters 2- Cluser reamining points w.r.t nearest center 3- Calculate the mean of each cluser 4- Repeat clustering based on means until no change 5- Compute sum of variations of each cluster (sum of squared differences) 6- Repeat several times starting from step 1 7- Consider clustering centers with min variations as best option for K clusters 8- Repeat the above algo starting with k = 1 and keep increasing k 9- Plot an elbow plot of the reduction in variations per k 10- Select the elbow of the plot as your K Semi-Supervised Learning (Maximization problem) • Few labeled data • Lots of unlabeled data • Make use of unlabeled data by the help of labeled data Steps: 1- Build supervised learning model based on labeled data 2- Expectation step: Label unlabeled data using the model built in 1 3- Maximization step: Retrain the model based on all the data 4- Repeat starting from step 2 until convergence Reinforcement Learning • Training Info = evaluations(rewards/penalties) • Objective of the agent is to get as much reward as possible What is a pattern? • A pattern is an abstract object, such as a set of measurements describing a physical object Why is object recognition a hard problem? • Due to variability of objects • Same objects can look different (rotated, sizes, different angle of camera) Classification VS Regression? • Both problems tend to learn an unkown function that maps outputs to inputs (function approximation) • Classification → Predict a label • Regression → Predict a quantity After collecting data and choosing the features which can effectively differentiate between the classes, a density function is estimated. • It is very difficult to design a classifier that yields no classification errors • Classification errors occur due to the variability of the patterns typically encountered • Multiple features reduces the classification errors Lecture 2 (Training Patterns) Feature Vector X(m) = N → Number of features X1 (m) X2 (m) ⋮ XN (m) m → Represents the mth training pattern M → Number of training patterns Decision Regions • The data points of each class occur in groupings or clusters in the feature space plot The equation to represent the linear decision boundary : W0 + W1 X1 + … + WN XN = 0 • W 0 and W i are the constant that determine the position of the hyperplane Types of Problems A problem is said to be linearly separable if there is a hyperplane that can separate the training data points of class C 1 from those of C 2 Lecture 3 (Pattern Classification Methods) Minimum Distance Classifier Steps: • Choose a center or representative pattern from each class V(k) • Give a pattern X that we would like to classify • Compute the Euclidean Distance from X to each center in V(k) d 2 = (y𝟐 − y1) 2 + (x2 − x𝟏) 2 • Find the index of the mnimum distance • To calculate the class center of a number of training patterns Mi 1 ∑ X(m) V(i) = Mi m=1 Disadvantages: • Too simple to solve difficult problemes • outlier affects the position if the mean of the class badly • poor performance if there are overlapping between classes Nearest Neighbor Classifier (NN) The class of the nearest pattern to Xdetermines its classification Steps: • Compute the distance between pattern X and each pattern X(m) in the training set • The class of the pattern m that corresponds to the minimum distance is chosen as the classification of X Advantages: • simplicity Disadvantages: • Sensitive to outliers • Patterns with large overlaps between the class can negatively affect performance K-Nearest Neighbor Classifier (KNN) ** (if number of classes are odd → K =??) Same to NN but taking k-nearest points into consideration Less dependent on strange patterns compared to the nearest neighbour classification rule Disadvantages: • The neighbours could be a bit far away from X leading to using information that might not be relevant Lecture 4 (Bayes Classification Rule) → It classifies the data point (belonging to the most likely class) that makes it an optimal classification rule → However, bayes classifier assumes that probability densities are known, which is not usually the case. → Note: The a priori probabilities represent the frequencies of the classes irrespective of the observed features (we calculate it from the training data labels) → Bayes classifier is a linear classifier iff the covarience matrix of all classes are the same. Steps: Given a pattern X (with unkown class) that we wish to classify • Compute P(C 1 |X), P(C 2 |X), ........ P(C k |X) (we wish to calculate but we cant so we use bayes rule) • Find the k giving maximum P(C k |X) Rule: P(Ci |X) = = P(Ci , X) P(X) P(X|Ci ) P(Ci ) P(X) P(Ck |X) → Posterior prob P(Ck ) → apriori prob P(X, Ck ) → class - conditional densities && class conditional probability function Total Probability law (Marginalization): P(x̱ ) = ∑ P(X|Ci ) P(Ci ) In reality P(X) doesn't have to be computed as it is a common factor PCorrect: (1) PError: P(error) = 1 - P(correct) To get the decision boundary P(X|C1 )P(C1 ) = P(X|C2 )P(C2 ) Gaussian Class-conditional densities 1-D **𝜇i = E(X) = X ∫ 2𝜋𝜎 e - (X - u ) 2 Variance = E (X - 𝜇) 2 1 P(X|Ci ) = 2𝜋𝜎 e - 2𝜎 2 = 𝜎2 (x - u i ) 2 2𝜎 2 Multi-dimensions Independent Case P(X |Ci ) = e 1 N (X j - 𝜇 j ) 2 - ∑ j=1 2 2𝜎j2 N (2𝜋) 2 𝜎 1 𝜎 2 ... 𝜎 N Dependent Case P(X|Cj ) = 1 - (X - 𝜇 ) T Σ -1 (X - 𝜇 ) e 2 N 1 2 (2𝜋) det 2 (Σ) dx all cov matrix are zeros except the main diagonal Lecture 5 Density Estimation Probability densities have to be estimated to apply bayes rule Histogram Analysis p(x) = m M * sizeOfBin m → number of data points falling in the bin M → total number of points that belongs to the same class • Weak method of estimation • Discontinuity of these density estimates, even though the true densities are assumed to be smooth Naive Estimator --more accurate than histogram analysis-- h P(X) = h #points falling in X - 2 , X - 2 Mh • Discontinuity of the density estimates • All data points are weighted equally regardless of their distance to the estimation point Kernel Density Estimator (Parzen Window) Steps: • Choose a bump function • Apply the bump function to each point • Sum all bump functions Choose bump functions as gaussian with standard deviation (bandwidth) h: 𝜙h (x) = M e -x 2 2h 2 2𝜋h 1 ∑ 𝜙h (X - X(m)) P(X|Ci) = M m=1 • 𝜙 h does not have to be gaussian 𝜙h = 1 X g h h • g(.) should integrate to 1 & known distribution • Check Parzen window estimator as a bump function • Note: – Naïve estimator is equivalent to a Parzen window estimator with: g(x) • Multi-dimension Form: = 1 - 0.5 < x < 0.5 = 0 otherwise How to choose h Hi = 𝜎i 4 (N + 2)M 𝜎i = [Σx ]i,i 1 N+4 M 1 ∑ (X(m) - 𝜇)(X(m) - 𝜇) T Σx = M m=1 N hopt 1 = ∑ Hi N i=1 Lecture 6 Feature Selection & Extraction • If we use too many features we will suffer from curse of dimensionality • Generalization ability: Is the ability of the classifier to generalize well for data it had not seen before Curse Of Dimensionality: • The parameters’ estimates (mean vector and covariance matrix) will not be very accurate • The data points will become scattered and clustering of classes will not be clear Types of Features to be Removed - Irrelevant Features - Correlated Features: – Features that vary very closely with other features (Ex: #corners & #lines) Feature Selection • From the available N features choose a subset of L (≪N) Sequentail Forward Selection Algorithm (SFS): 1- Examine each feature Xi, take alone, design the classifier and select the best feature 2- From the reamining features, select the one that together with the selected group, gives the best performance 3- Continue in this manner until you have selected L features Sequential Backward Selection Algorithm(SBS): 1- Start with all features selected. Remove one feature at a time, design the classifier and compute Pc 2- The feature we end up removing is the one that results in the least reduction of Pc 3- Keep removing till reaching the no. L selected features Methods of Feature Selection: • Filter type: select features without looking into the classifier you are going to use. – Advantages → Faster Execution & Generality – Disadvantages → Tendency to select large subsets • Wrapper Type: select features by taking into consideration the classifier you will use(SFS & SBS) – Advantages → Accuracy – Disadvantages →Slow Exectuion & Lack of Generality Feature Extraction • Transform the available N features into a smaller no. of L features through certain transformations (usually linear) • Transformed features may have no physical meaning → explaining the model is problematic • May not be suitable to every domain Principal Component Analysis (PCA): • In PCA, the most interesting directions are those having large variations (large variance) Steps: - Transform data to zero mean (Y = X - 𝜇 ) - Estimate covariance matrix from data M 1 Σ = ∑ Y(m)Y(m) T M m=1 - Compute eigen values and vectors (principle component) for Σ (A - 𝜆I) * X = 0 |A - 𝜆I| = 0 - Choose the eigen vectors corresponding to the big eigen values - Perform transformation of data to a new space u1T Z= u2T ⋮ uLT Y 𝑍 = 𝐴𝑌 Cov(Z) = 𝐴Σ𝐴 T = U T ΣU = Ω Σ𝐔 = 𝑼𝛀 𝑈 T 𝛴𝑈 = 𝛺 To choose L (the new number of features): • Pick largest 4-5 eigen values and sum them then dvide by the sum of all eigen values • This will give the accuracy. • Repeat picking an extra eign value in the numerator sum untill reaching the desired accuracy (99%). Lecture 7 Classifiers Combinations Why classifiers combination? Best classifier for the training set might not be the best for the test set, so it is risky to choose the best in training set. As a result, it is better to combine classifiers with high diversity. Ways to combine • Majority Vote • Average class posterior propabilities • Average some sort of score function • Can also choose median instead of average Adaboost • Iterative procedure • Tries approximate the bayes classifier by combining many best weak classifiers to create a strong classifier • AdaBoost works well only in case of binary classification • AdaBoost assumes that the error of each weak classifier is less than 0.5 – 𝛼 t is negative if error is greater than 0.5 (log(<0) = negative number ) – The error of random guessing in case of two classes is 0.5 – In case of K multi classes the random guessing error rate is Multi Class AdaBoost 𝛼t = log 1 - err + log(k - 1) err ∴ 𝛼t is positive only 1 (1 - err) > k k-1 k Weak Classifier: is able to guess the right class with a percentage slightly bigger than random guessing Strong Classifier: is usually correct > 80% Steps: 1- Initialize weights of the training examples: wm = 1 M 2- For t=1 to T: • Select a classifier h t that best fits to the training data using weights w m of the training examples • Compute error of h t as: err t = ∑ M wm (cm ≠ ht (xm )) m=1 ∑ M wm m=1 • Compute weight of classifier: 𝛼 t = log (1 - errt ) errt • Update weights of wrongly classified examples: wm = wm e 𝛼t (cm ≠ht (xm )) • Renormalize weights w m ∑ 𝛼t (ht (x) 3- Output: C(x) argmax Overfitting: = k) • AdaBoost is robust to overfitting given that select the best weak classifiers • However, relying on complex classifiers will be more prone to overfitting Lecture 8 GMM Why GMM? •Assume we have a small data set, not possible to estimate class conditionals using kernel density estimator. Instead, we model each class conditional as a sum of multivariate Gaussian densities. K P( X ) = ∑ wj j=1 -1 1 - (X-𝜇 ) T ∑ j (X-𝜇 ) j j e 2 N 1 (2𝜋) 2 det 2 K = ∑ w j N X, 𝜇 j , Σ j j=1 K ∑ wj = 1 j=1 ∑ j Issues with GMM 1. Initialization • Expectation–Maximization (EM) an iterative algorithm which is very sensitive to initial conditions. (Start from trash →end up with trash) • Usually we use K-Means to get a good initialization. 2. Determining Number of Gaussian Components • 1 < k < M // K = number of gaussians • Use information-theoretic criteria to obtain the optima K • Methods: o MDL (Minimum Description Length) o AIC (Akaike’s Information Criterion) o BIC (Bayesian Information Criterion) o MML (Minimum Message Length) Lecture 9 Decision Trees & Random Forests Building a Decision Tree split(node, {training examples of that node}): 1- X <- Get best attribute to split examples 2- For each value of X create a child node 3- Split examples to each child node if(subset of examples is pure): stop() else: split(childNode, {subset of examples}) Which attribute to choose for splitting? You should get heavily biased subsets (decrease the uncertainity), and should put the purity of the split into consideration Entropy H(S) = - pyes lg(pyes ) - pno lg(pno ) H(s) → Entropy of example subset s pyes → % of yes examples within subset s pno → % of no examples within subset s Information Gain ∑ Gain(S, X) = H(S) - v 𝜖 values(X) |Sv | |S| H(Sv ) Overfitting If we split the decision tree until all training examples are correctly classified, all leaf nodes will be pure, even if they have just one example (singletons), this will cause overfitting and the model can't generalize on new data To avoid overfitting: Way 1: Stop splitting when not statistically significant Way 2 (Post Prune) Better way: 1- For each node (ignoring leaf nodes): i. Consider removing the node and all its children ii. Measure performance on validation set 2- Remove node that leads to best improvement 3- Repeat until further removals are harmful NB: Greedy approach but not optimal Gain Ratio SplitEntropy(S, X) = ∑ v 𝜖 values(X) GainRatio(S, X) = |Sv | |S| lg |Sv | |S| Gain(S, X) SplitEntropy(S, X) Continuous Attributes • Continuous attributes can be repeated, unlike discrete attributes • Real values of attributes are sorted and average of each two adjacent examples is a threshold to be considered Entropy in muilti - class classification H(S) = - ∑ pi lg(pi ) i Regression: • Predicted output → avg. of training examples in the subset (or linear regression at leaves) • Minimize variance in subsets (instead of maximize gain) Pros & Cons Pros • Interpretable (Not a black box) • Easily handles irrelevant attributes (Gain = 0) • Can handle missing data • Very compact • Very fast at testimg time (O(depth)) Cons • Greedy so may not find best tree • Only axis aligned of splits of data (Continuous data) Uniform Sample with Replacement Uniform sampling is that the selection probability of any example in S is 1 |S| Replacement means that a selected item can be reselected multiple times for the same subset Random Forest Steps: • M is number of traning examples • Uniformly sample T subsets(each of size M) with replacement • Build T decision trees with zero training error • Take the average/votes of T trees Building Tree: • D is the number of features • Sample K features randomly (K < D) • Only split on these K random features • New K features are sampled for every single split Notes: - This means that different trees are built so different mistakes by each tree - No need to tune hyperparameters - No need to pre-process or scale inputs - Increase T as much as you can afford (Parallel Processing) - No need for training/validation split - Can estimate test error directly from training set - Second best approach - Not suitable for raw images - Improvement: Prune the last split of trees to decrease the size of trees and decrease noise - Computer error for each training example (Consider only trees that do not include that example in their training subset 60%) K = ⏲⏳ ⏳ D⏴⏵ ⏵ Error Calculations: M EOFB T 1 ∑ 1 ∑ loss(hj (xi ), yi ) = M i=1 Zi (xi, ,yi ) T ∑ Zi = 1 j (xi ,yi )∉ Sj Neural Networks • Weights are the parameters that encode the information in the brain • Brain is superior because of the massive parallelism • Weights act like the “storage” in computers • When a human encounters a new experience the weights of his brain gets adjusted • Memories are encoded in the weights • The weights determine the functionality of the model • The neuron can implement a linear classifier Augmented Vectors: u(m) = W = W0 w w= W1 W2 ⋮ WN 1 X(m) = 1 X1 (m) ⋮ XN (m) (2) T y(m) = f W u(m) • It can be used for non linearly separable problems as long as it produces a low classification error rate Neural Networks Steps (Linear Preceptron Algorithm): 1- Initialize the weights and threshold (bias) randomly 2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its corresponding desired output d(𝑚) T 3- Calculate the actual output for pattern m → y(m) = f W u(m) 4- Adapt the weights according to the following rule (called WidrowHoff rule): W(new) = W(old) - 𝜂[d(m) - y(m)]u(m) 𝜂 → Learning rate 5- Go to step 2 until all patterns are classified correctly, i.e.,d(m)=y(m) for m=1, … , M Least Square Classifier • If the problem is not linearly separable, then the algorithm will not converge & will keep cycling forever that is why we need least square classifier (Rosenblatt theorem) • Define an error function M E = ∑ m=1 W T u(m) - bm 2 It measures how close the obtainedsolution is to the desired one. • We then seek to minimize the error function by finding W that minimizes the E • Define the gradient vector ∂E ∂W0 ∂E ∂W1 ⋮ ∂E ∂WN – ∂E = ∂W – ∂E = 0 and solve for W ∂W – Advantages: * Can converge if the problem is not linearly spearable – Disadvantages: * Linear classifiers don't solve all problems Multi-Layer Network • The powerful feature of multilayer networks (aka. feed forward networks) is its ability to learn • We usually use hidden node fn’s (Activation) that are continuous Gradient Descent • We use the concept of steepest descent • To update the weights W(new) = W(old) - 𝛼 ∂E ∂W 𝛼 → The learning rate Back Propagation Algorithm • It is an algorithm based on the steepest descent concept • Used to train multilayered networks Steps: 1- Initialize the weights and threshold (bias) randomly [-1,1] 2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its corresponding desired output d(𝑚) 3- For m=1 to M: • Present u(m) to the network and compute the hidden layer outputs and final layer outputs • Use these outputs in a backward scheme to compute the partial derivatives of error fn. w.r.t. to the weights of each layer • Update the weights → W i[,Lj ] (new) = W i[,Lj ] (old) - 𝛼 ∂Em ∂Wi[,Lj ] 4- Computer the total error and stop in case of convergance Disadvantages of Back Propagation • Too small 𝛼 → Very small steps and reach the min slowly • Too large 𝛼 → Leads to oscillations and possibly not converge at all • Use variable 𝛼 (Start with large then decrease it) → Good range is between 0.001 and 0.05 • Prone to get stuck in local minima – To avoid this problem, repeat the training several times, each time with different set of initial weights Types of Weight updates → Check answers below Other Optimization Algorithms • Gradiet Descent with momentum – Smooth out the steps of the gradient descent using a moving average of the derivatives, so it avoid oscillations 𝑫𝑾 = 𝜷𝑫𝑾 + (𝟏 − 𝜷) 𝟏−𝜷 𝝏𝑬 𝝏𝑾 W = W - 𝛼DW DW = 0 initially • RMSProp Sw = 𝜷Sw + (𝟏 − 𝜷) 𝝏𝑬 𝝏𝑾 2 𝟏−𝜷 W = W -𝛼 𝝏𝑬 𝝏𝑾 Sw + 𝜖 • Adam 𝜷Sw + (𝟏 − 𝜷) Sw = Dw = 𝝏𝑬 𝝏𝑾 2 𝟏−𝜷 𝜷Dw + (𝟏 − 𝜷) 𝝏𝑬 𝝏𝑾 𝟏−𝜷 W = W -𝛼 Dw Sw + 𝜖 Regularization • Used to prevent overfitting • Intuition: set the weights of some hidden nodes to zero to simplify the network M 1 ∑ Em + 𝜆 ||W||22 • L 2 Regularization → M m=1 2M M 1 ∑ Em + 𝜆 ||W||1 • L 1 Regularization → M m=1 2M • L 2 regularization is used more often • 𝜆 is the regularization parameter (hyper parameter) Dropout Regularization • Every epoch → shutdown random number of neurons • Not all nodes get trained every epoch • No neuron get overfitted • Simpler Model • More generalized Input & Output Normalization • Inputs have to be approximately in the range of 0 to 1 or -1 to 1 Machine Learning Recipe Underfitting Normal Overfitting Training Set Large Error Small Error Small Error Validation Set Large Error Small Error Large Error Hyper-parameters Tuning • 1st – Learning Rate • 2nd – Momentum Parameter → 𝛽 = 0.9 – Number of hidden nodes – Mini batch size • 3rd – Number of hidden layers – Learning rate decay • Adam Parameters Tuning Process • Try random values: don’t use a grid • Coarse to fine scheme (Focus on the good regions) • Use appropriate scale – Don't sample uniformly – Use logarithmic scale Machine Learning Model Selection 1- Categorize the problem • Input → Supervised? Unsupervised? • Output → Numerical? Regression? 2- Understand your data • Analyze the data • Process the data • Feature Engineering 3- Determine the possible algorithms 4- Implement the machine learning algorithms 5- Tune hyperparameters Time Series Prediction • Remove the seasonal periodicities TS → Deseasonalize → Predict → Return back seasonal components a(year) = 1 ∑ x(t) 12 window x(i) a(year) Z(i) = u(i) = ∑ Zj (i) j xdeseasonal (t) = #years x(t) u(month(t)) What would have changed in the previous question if the neural network was solving a regression problem rather than a classification problem? The difference will be in the loss function of the model. In case of classification problem → L = - y log a [2] + (1 - y) log 1 - a [2] (Cross Entropy) M In case of regression problem → 1 ∑ a i[2] - yi L= M i=1 L= 2 (MSE) 1 M ∑ (y i - y ) 2 i M i=1 (RMSE) One more modified equation will be M ∂L 2 ∑ a [2] - yi = [ 2 ] M i=1 ∂a What would have changed in the previous question if the neural network was solving a multi-class classification problem rather than a binary classification problem? Write down the modified equations. • The output layer will consist of K nodes where K is number of classes • We need to change the sigmoid activation function in the final output layer to softmax Activation Function, as it will result for K probabilities that will sum up to 1 • We need to change the derivative of softmax. yi = e zi [L] ∑ e zj [L] j What does the learning rate (alpha parameter) mean? How does changing the learning rate affect the training process? • It controls the amount of apportioned error that the weights of the model are updated with each time they are updated • Too small alpha → Convergence will be slow • Too large alpha → Convergence will oscillate around the minimum • Good range is between 0.001 and 0.05 W(new) = W(old) - 𝛼 ∂E ∂W What is the difference between batch gradient descent and sequential (stochastic) gradient descent? Sequential (Stochastic) Gradient Descent (SGD) Initialize weights and biases randomly. for i = 1: N iterations → Training Loop (epochs) for j = 1: M (M: # examples): 1. Forward Propagation 2. Compute the loss function. 3. Backward Propagation. 4. Update the weights and biases. Adantages: • Faster in update compared to gradient descent Disadantages: • Hard to converge since it depends on every single example • Loss speedup from vectorization Batch Gradient Descent (BGD) Initialize weights and biases randomly. for i = 1: N iterations → Training Loop (epochs) for j = 1: M (M: # examples): 1. Forward Propagation 2. Accumulate to the loss 3. Backward Propagation. (with averaging over M) 4. Update the weights and biases. Adantages: • Optimization is more consistent Disadantages: • Slow (too long per iteration) Mini-Batch Gradient Descent (BGD) Initialize weights and biases randomly. for i = 1: N iterations → Training Loop (epochs) for j=1: minibatches for k = 1: M (M: # examples in the jth minibatch): 1. Forward Propagation 2. Accumulate to the loss 3. Backward Propagation. (with averaging over M) 4. Update the weights and biases. Adantages: • Fast Mention four different types of activation functions. Write down the mathematical expression for each of them as well as their derivatives. Mention the advantage(s) and disadvantage(s) of each of them. Sigmoid → 1 1 + e -x Slow learning Used only in output layer in binary classification problems Unfortunately it has 0 gradient in some parts Tanh → e x - e -x e x + e -x Comes next after ReLu in popularity More famous with sequence problems (Speech Recognition) Unfortunately it has 0 gradient in some parts ReLu → f(x) = max(x,0) Leaky ReLu → f(x) = x @x>0 & 𝛼x otherwise - Most famous now - Faster learning - Leaky relu is better than relu because no 0 gradient in the negative part What are the hyperparameters in the gradient descent update algorithm? How to select these hyperparameters? • Learning rate 𝛼 • Size of the mini batch • Number of hidden nodes • Number of layers Tuning Process: 1- Try random values 2- Corase to fine scheme 3- Use appropriate scale (Don't sample uniformly, Use logarithmic scale) What is the difference between training set, test set and validation set? Are there any guidelines in selecting each of them? Training Set: Here, you have the complete training dataset. You can extract features and train to fit a model and so on. Validation Set: This is crucial to choose the right parameters for your estimator. We can divide the training set into a train set and validation set. Based on the validation test results, the model can be trained(for instance, changing parameters, classifiers). This will help us get the most optimized model. Testing Set: Here, once the model is obtained, you can predict using the model obtained on the training set. Best Practice for selecting the sets: • Training → 60% • Validation → 20% • Testing → 20% In case of big data: • Training → 98% • Validation → 1% • Testing → 1% Mention the difference between overfitting and underfitting? Give an example to each of them. Underfitting Normal Overfitting Training Set Large Error Small Error Small Error Validation Set Large Error Small Error Large Error What are different optimization algorithms? State the weight update equation for each of these optimizers. • Gradient Descent without momentum W(new) = W(old) - 𝛼 ∂E ∂W • Gradiet Descent with momentum – Smooth out the steps of the gradient descent using a moving average of the derivatives, so it avoid oscillations 𝑫𝑾 = 𝜷𝑫𝑾 + (𝟏 − 𝜷) 𝝏𝑬 𝝏𝑾 𝟏−𝜷 W = W - 𝛼DW DW = 0 initially • RMSProp - Slow down learning in unintended directions - Avoid oscillations 𝜷Sw + (𝟏 − 𝜷) Swi = 2 𝟏−𝜷 𝜷Sw + (𝟏 − 𝜷) Swj = 𝝏𝑬 𝝏𝑾i 𝝏𝑬 𝝏𝑾j 𝟏−𝜷 Wi = Wi - 𝛼 Wj = Wj - 𝛼 𝝏𝑬 𝝏𝑾i Swi + 𝜖 𝝏𝑬 𝝏𝑾j Swj + 𝜖 • Adam (combines both RMSProp & momentum) 2 𝜷Sw + (𝟏 − 𝜷) Swi = 2 𝟏−𝜷 𝜷Dwi + (𝟏 − 𝜷) Dwi = 𝝏𝑬 𝝏𝑾i 𝝏𝑬 𝝏𝑾i 𝟏−𝜷 Wi = Wi - 𝛼 Dwi Swi + 𝜖 What is the main difference between gradient descent and gradient descent with momentum? With Stochastic Gradient Descent we don’t compute the exact derivate of our loss function. Instead, we’re estimating it on a small batch. Which means we’re not always going in the optimal direction, because our derivatives are ‘noisy’. Gradient descent with momentum smoothes out the steps of the gradient descent using a moving average of the derivatives, so it avoid oscillations and faster learning, as it uses a moving out average to take into considereation the previous results. Mention two different ways to reduce overfitting. Explain how each of them reduces overfitting. Regularization • Used to prevent overfitting • Intuition: set the weights of some hidden nodes to zero to simplify the network M 1 ∑ Em + 𝜆 ||W||22 • L 2 Regularization → M m=1 2M M 1 ∑ Em + 𝜆 ||W||1 • L 1 Regularization → M m=1 2M • L 2 regularization is used more often • 𝜆 is the regularization parameter (hyper parameter) Dropout Regularization • Every epoch → shutdown random number of neurons • Not all nodes get trained every epoch • No neuron get overfitted • Simpler Model • More generalized MCQ Questions: 1. Which hyperparameter of the following needs to be tuned first in a typical neural network problem? a. Momentum parameter. b. Mini batch size c. Learning rate d. Number of hidden nodes in each layer. 2. The softmax layer is only used in a. Binary classification problems. b. Multiclass classification problems. c. Regression problems. d. All of the above. 3. As the number of hidden layers increases in a neural network, a. The time needed to train the network decreases. b. The network learns more complex functions and features. c. The network converges to a local minimum of the cost function faster. d. The network may be subject to overfitting. 4. The activation function recommended to use when working with images is a. ReLU function. b. Step function. c. Sigmoid function. → Final layer of binary classification problems d. Tanh function. → For Speeches 5. The neural network that tries to match two given inputs and detect how similar or different they are from each other is called a. Convolutional neural network. b. Siamese network. c. Recurrent neural network. d. Generative Adversarial Network. 6. A network with a skip connection from output layer to input layer is called a. Convolutional neural network. b. Siamese network c. Recurrent Neural Network. d. Generative Adversarial Network. 7. A neural network used mainly to generate features from input images and represent these images in a compressed low dimensional space is called: a. Convolutional neural network. b. Auto Encoder network. c. Recurrent Neural Network. d. Siamese Network. 8. If the neural network is subject to overfitting, then we can reduce the effect of overfitting by: a. Increasing the size of the training data. b. Increasing the size of the neural network. c. L2- Regularization. d. Dropout regularization. 9. A time series is composed of: a. Trend b. Seasonality c. Random Noise d. All of the above 10. For the neural network to learn functions such as XOR and XNOR, it is sufficient to have: a. 1 input layer and 1 output layer. b. 1 input layer, 1 hidden layer, 1 output layer. c. 1 input layer, 2 hidden layers, 1 output layer. d. It is dependent on the number of inputs, and so it is impossible to tell. QUESTIONS Define the Machine Learning Recipe Underfitting Normal Overfitting Training Set Large Error Small Error Small Error Validation Set Large Error Small Error Large Error What’s the key idea of Adaboost algorithm? Explain using a diagram Tries approximate the bayes classifier by combining many best weak classifiers to create a strong classifier Weak Classifier: is able to guess the right class with a percentage slightly bigger than random guessing Strong Classifier: is usually correct > 80% Why can’t adaboost work for more than binary classification? How would you modify it to work for more? AdaBoost assumes that the error of each weak classifier is less than 0.5 ∴ 𝛼 t is negative if error is greater than 0.5 To overcome this problem, the equation to calculate 𝛼 t is updated to: 𝛼t = log 1 - err + log(k - 1) err ∴ 𝛼t is positive only 1 (1 - err) > k What’s the difference between wrapper method and filter method for feature selection? Methods of Feature Selection: • Filter type: select features without looking into the classifier you are going to use. – Advantages → Faster Execution & Generality – Disadvantages → Tendency to select large subsets • Wrapper Type: select features by taking into consideration the classifier you will use( SFS & BFS) – Advantages → Accuracy – Disadvantages →Slow Exectuion & Lack of Generality Compare between AI, ML, & DL • AI includes any technique that mimics human behavior • Classical AI (Excluding ML and DL) – Rule based expert systems – Knowledge reasoning – No learning needed (No data is available) – Simpler applications • Machine Learning – Programs learn from examples – Improve performance based on experience – Finding patterns in data – Adapt to complex and changing environments • Deep Learning – Multi-layered neural networks – end-to-end solution (Works on raw data) – Required big data and high computational power If u train a neural network and get 54% training accuracy and 51% validation accuracy, explain what u will do next This means the neural network is underfitted so to overcome this problem: • Use larger network (more nodes or more layers) • Traing for longer time Explain with examples the linear perceptron update rule Widrow Hoff Rule W(new) = W(old) + 𝜂[d(m) - y(m)]u(m) 𝜂 → Learning rate If d(m)=y(m) then no change is needed in the weights. To show that each iteration corrects errors: – Let actual class 𝒅(𝒎)=𝟏 T – If neuron classification y(m) = f W u(m) = 0 – So, W T u(m) < 0 – However, we want 𝒚(𝒎)=𝟏 –We need to correct wrong classification by making what is inside 𝒇(∙)more positive, which will make 𝐲(m) move likely to be 1 W(new) = W(old) + 𝜂[d(m) - y(m)]u(m) T y(new) = f Wnew u(m) T y(new) = f( Wold . u(m) + 𝜂[d(m) - y(m)]u T (m)u(m) T y(new) = f( Wold . u(m) + 𝜂[1 - 0]u T (m)u(m) T y(new) = f( Wold . u(m) + 𝜂||u(m)|| 2 Explain the naive estimator method, write the formula used and compare it to histogram analysis Naive Estimator h P(X) = h #points falling in X - 2 , X + 2 Mh • Discontinuity of the density estimates • All data points are weighted equally regardless of their distance to the estimation point Histogram Analysis p(x) = m M * sizeOfBin m → number of data points falling in the bin M → total number of points that belongs to the same class • Weak method of estimation • Discontinuity of these density estimates, even though the true densities are assumed to be smooth Explain the kernel density estimation technique with multidimensional case equations. To apply the kernel density estimation technique, we need the following: 1. Bump Function (g(x)) • We can model the bump function as follows in the multidimensional case 2. Choosing Optimal h (diagonal bandwidth matrix) • We get H i in each dimension with the normal reference rule: Then we get the average and use it as our h opt Explain why relu is used instead of sigmoid The constant gradient of ReLUs results in faster learning than sigmoid The reduced likelihood of the gradient to vanish. Explain the least square classifier. No need for proof Least Square Classifier • If the problem is not linearly separable, then the algorithm will not converge & will keep cycling forever that is why we need least square classifier (Rosenblatt theorem) • Define an error function M E = ∑ m=1 W T u(m) - bm 2 It measures how close the obtainedsolution is to the desired one. • We then seek to minimize the error function by finding W that minimizes the E • Define the gradient vector ∂E ∂W0 ∂E ∂W1 ⋮ ∂E ∂WN – ∂E = ∂W – ∂E = 0 and solve for W ∂W – Advantages: * Can converge if the problem is not linearly spearable – Disadvantages: * Linear classifiers don't solve all problems What are the main issues in GMM? 1. Initialization • Expectation–Maximization (EM) an iterative algorithm which is very sensitive to initial conditions. (Start from trash →end up with trash) • Usually we use K-Means to get a good initialization. 2. Determining Number of Gaussian Components • Use information-theoretic criteria to obtain the optima K • Methods: o MDL (Minimum Description Length) o AIC (Akaike’s Information Criterion) o BIC (Bayesian Information Criterion) o MML (Minimum Message Length) Types of features to be removed in feature selection method Types of Features to be Removed - Irrelevant Features - Correlated Features: – Features that vary very closely with other features (Ex: #corners & #lines) Explain regularization and why it is used Regularization is used to avoid overfitting Regularization • Used to prevent overfitting • Intuition: set the weights of some hidden nodes to zero to simplify the network M 1 ∑ Em + 𝜆 ||W||22 • L 2 Regularization → M m=1 2M M 1 ∑ Em + 𝜆 ||W||1 • L 1 Regularization → M m=1 2M • L 2 regularization is used more often • 𝜆 is the regularization parameter (hyper parameter) Dropout Regularization • Every epoch → shutdown random number of neurons • Not all nodes get trained every epoch • No neuron get overfitted • Simpler Model • More generalized Input & Output Regularization • Inputs have to be approximately in the range of 0 to 1 or -1 to 1 Explain how bayes rule can be used in classification Given a pattern 𝑋(with unknown class) that we wish to classify • Compute 𝑃(𝐶1|𝑋), 𝑃(𝐶2|𝑋), … , 𝑃(𝐶𝐾|𝑋) • Find the k giving maximum 𝑃(𝐶𝑘|𝑋) • P(C k |X) → posterior probaility • P(C k ) → a priori probability • P(X|C k ) → Class conditional density P(Ck |X) = P(X|Ck ) P(Ck ) P(X) This is our classification according to the Bayes classification rule It is an optimal classification rule, The reason is that it chooses the most likely class so nothing could be better Explain the importance of deseasonalization using diagrams. It removes the seasonal periodicities to predict trends correctly After trend prediction, seasonality can be recovered via multiplication by the corresponding seasonal average Explain steps of back propagation, and state its disadvantages. Steps: 1- Initialize the weights and threshold (bias) randomly [-1,1] 2- Present the augmented input (or feature) vector of the m th training 𝑢(𝑚)and its corresponding desired output d(𝑚) 3- For m=1 to M: • Present u(m) to the network and compute the hidden layer outputs and final layer outputs • Use these outputs in a backward scheme to compute the partial derivatives of error fn. w.r.t. to the weights of each layer • Update the weights → W i[,Lj ] (new) = W i[,Lj ] (old) - 𝛼 ∂Em ∂Wi[,Lj ] 4- Computer the total error and stop in case of convergance Disadvantages of Back Propagation Can often be slow in reaching the min • Too small 𝛼 → Very small steps and reach the min slowly • Too large 𝛼 → Leads to oscillations and possibly not converge at all • Prone to get stuck in local minima To overcome the disadvantages: • Use variable 𝛼 (Start with large then decrease it) → Good range is between 0.001 and 0.05 • To avoid this problem, repeat the training several times, each time with different set of initial weights Explain the three methods used to update weights. Sequential (Stochastic) Gradient Descent (SGD) Initialize weights and biases randomly. for i = 1: N iterations → Training Loop (epochs) for j = 1: M (M: # examples): 1. Forward Propagation 2. Compute the loss function. 3. Backward Propagation. 4. Update the weights and biases. Adantages: • Faster in update compared to gradient descent Disadantages: • Hard to converge since it depends on every single example • Loss speedup from vectorization Batch Gradient Descent (BGD) Initialize weights and biases randomly. for i = 1: N iterations → Training Loop (epochs) for j = 1: M (M: # examples): 1. Forward Propagation 2. Accumulate to the loss 3. Backward Propagation. (with averaging over M) 4. Update the weights and biases. Adantages: • Optimization is more consistent Disadantages: • Slow (too long per iteration) Mini-Batch Gradient Descent (BGD) Initialize weights and biases randomly. for i = 1: N iterations → Training Loop (epochs) for j=1: minibatches for k = 1: M (M: # examples in the jth minibatch): 1. Forward Propagation 2. Accumulate to the loss 3. Backward Propagation. (with averaging over M) 4. Update the weights and biases. Adantages: • Fast Describe how CNNs work, and why they have less memory footprints. It is mostly applied to imagery problems - Layers extract features from input images, - Convolution layer, i.e., filtering - Pooling Layer, i.e., reduce input (avg or max) - Fully Connected Layer, i.e., as in multi layer NN, at the final layers Why less memory footprints? - Parameter sharing (compared to fully connected layers) - Sparsity of connections (The pixel at the next layer is not connected to all the 100 from the first layer) Discuss 3 limitations of Deep Learning. 1. Not a magic tool! • Lack of adaptability and generality compared to human vision system • Not able to build general intelligent machine 2. Can’t fit all real-world scenarios • Infinite Variables 3. Large amount of labeled data can lead to impressive achievements correspond to supervised learning but • Expensive! • Sometimes experts & special equipment are needed 4. Datasets may be biased • Deep Networks become biased against rare patterns • Serious consequences in some real-world applications (e.g., medical, automotive, … etc.) • Classification may be sensitive to viewpoint if one of the viewpoints is under represented • Solution: Researchers should consider synthetic generation of data to mitigate the unbalanced representation of data 5. Sensitive to standard adversarial attacks • Datasets are finite and just represent a fraction of all possible images • Solution: Add extra training, i.e., “adversarial training” 6. Over sensitive to changes in context • Limited number of contexts in dataset, i.e., monkey in jungle • Combinatorial Explosion! 7. Combinatorial Explosion • Real world images are combinatorial large • Application dependent (e.g., medical imaging is an exception) • Considering compositionality may be a potential solution • Testing is challenging (consider worst case scenarios) 8. Visual understanding is tricky • Mirrors • Sparse Information • Physics • Humor 9. Unintended results from fitness function Explain the K-folds validation method. In what context is it used Used for parameter tuning over the training dataset. In this technique, the parameter K refers to the number of different subsets that the given data set is to be split into. Further, K-1 subsets are used to train the model and the left-out subsets are used as a validation set. Steps involved in the K-fold cross-validation: 1. Split the data set into K subsets randomly 2. For each one of the developed subsets of data points 3. Treat that subset as the validation set 4. Use all the rest subsets for training purpose 5. Training of the model and evaluate it on the validation set or test set 6. Calculate prediction error 7. Repeat the above step K times i.e., until the model is not trained and tested on all subsets 8. Generate overall prediction error by taking the average of prediction errors in every case