Notation m – number of training examples x – input variable / feature y – output variable / target variable (x(i),y(i)) – the ith data a := a+1 – assign a+1 to a Supervised Learning - regression problem - classification problem ex. Label spam email Unsupervised Learning - clustering problem - non-clustering problem ex. Cocktail party Linear Regression (J) cost function – error function (function to evaluate the error occurred from the used function. Gradient descent algorithm - for minimizing alpha – learning rate ( how fast we step down the hill) batch gradient descent – when apply gradient descent to a function that look on every training sample so in this case, it is called batch gradient descent. 2 Running time : O(kn ) Linear regression with multiple features = θTx Feature scaling - Make gradient descent run faster - making the scale between feature and output to be in the appropriate value ex. X1: buying cost of house (1,000,000 bath+) X2: time (1 year, 2 years +) Y: selling cost of house (600,000 bath+) Mean normalization replace xi with xi – ᶣ to make zero mean x1 ← (x1 - ᶣ)/s ; s: can be SD or max-min Learning Rate How to choose - as we know, the lower alpha → the higher iteration → higher chance to get min J(θ) - Test when min J(θ) converge → automatic convergence test→ if J(θ) decreases by less than 10-3 in one iteration. - Try choosing 0.001, 0.01, 0.1 , 1,… Polynomial Regression For example h(θ) = θ1x1 + θ2x12 + θ3x13 - create feature x2 as x12 and x3 as x13 Normal Equation θ = (XTX)-1XTy → get min J(θ) Octave : pinv(X’*X)*X’*y Runing time : O(n3) If XTX is non-invertible - Redundant function - Too many features (m<=n) Delete some features or use regularization Classification Algorithm Logistic Regression Sigmoid function/Logistic function : g(z) = 1 / (1 + e-z) ; z = θTx = h(θ) now our prediction is g(h(x)) which is 1 / (1 + e-h(x)) hθ(x) will give us the probability that our output is 1 h θ(x) >= 0.5 → g( θTx) >= 0.5 → θTx >= 0 Decision Boundary – a line classifying a group Non-linear Decision Boundary – do like polynomial regression Cost function cost(h θ(x(i)),y(i)) = -log(h θ(x)) if y = 1 - log(1 - h θ(x)) if y=0 Interpreting : Cost = 0, if y =1 and h θ(x) = 1 Interpreting : Cost = infinity, if y =0 and h θ(x) = 1 Simplified cost function cost(h θ(x(i)),y(i)) = -ylog(h θ(x)) - (1-y)log(1 - h θ(x)) J(θ) = (1/m)( sum of from 1 to m -y(i)log(h θ(x(i))) – (1-y(i) )log(1 - h θ(x(i)) )) Applying Gradient Descent → where h θ(x(i)) = 1 / (1 + e-θTx) Advanced Optimization algorithm - Gradient Descent - Conjugate gradient \ - BFGS > much faster than Gradient Descent - L-BFGS / Create my own function to compute Cost function of a set of theta function [jVal, gradient] = costFunction(theta) jVal = [...code to compute J(theta)...]; gradient = [...code to compute derivative of J(theta)...]; end Built in function in Octave options = optimset('GradObj', 'on', 'MaxIter', 100); initialTheta = zeros(2,1); [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options); // options is kind of setting value in this case “gradObj on” → provide gradient, “MaxIter 100” → 100 times // fminunc : function minimization unconstrained providing optimal theta, in case there is only one theta fminunc cannot be used Multiclass classification One vs all : consider the other classes as the same class. Then, we got n classfiers: htheta(i). → Train each classifier with those m datas → To make a prediction, which one give the most probability, that one is the result. Regularization Overfitting problem undefit model: high bias overfit model: high variance too many features, fail to generalize new examples Options to solve: 1. Reduce the number of features: manually or model selection algorithm. 2. Regularization: keep all value but reduce the value of theta. This is a regularized linear regression and also a cost function. The last expression lambda is called regularization parameter. This is done to increase the cost so that it selects the other theta value that causes less cost. The higher lambda is, the less it overfit. So, lambada must be carefully chosen well. Regularized Linear regression Gradient Descent We will modify our gradient descent function to separate out θ 0 from the rest of the parameters because we do not want to penalize θ0. Note: alpha*lambda / m will always be less than 1 Normal equation The inside matrix is possible to prove that it is invertible Regularized Logistic Regression Gradient Descent Similar with regularized linear regression but different hypothesis equation. Neural Network Terminology ai(j) = activation of unit I in layer j θ(j) = matrix of weight controlling function mapping from layer j to layer j+1 h can be called an activation function. If network has sj units in layer j and sj+1 in layer j+1, then Θ(j) will be of dimension sj+1 x (sj + 1). The +1 comes from bias node. g function is a sigmoid function. Forward propagation To make it short, a1(j) = g(z1(j)). Θij is the multiplicand for aj in the current layer to ai in the next layer. Example Let a(1) = x in Rn+1 , denote the input a0(1) = 1 a(2) can be computed by z(2) = Θ(1) x a(1); a(2) = g(z(2)) This theta is represented the function of or function. Multiclass classification Use one vs all method. The hΘ(x) look like this [0; 1; 0; 0] The training set: (x(1),y(1)), (x(2),y(2)),… , (x(m),y(m)) y(i) look like this [1; 0; 0; 0] , [0; 1; 0; 0], [0; 0; 1; 0] ,... Cost function L = total no. of layers in network sl = no. of unit in layer l (not count bias) Backpropagation Algorithm g’ is a differential function of the sigmoid function. Which is g’(z) = g(z) * g(1-z) z(i) is the sum of product of theta(i-1) and a(i-1) The term “Backpropagation” comes from we compute the error from the back to the front. That triangle is called delta l I j D(l) can be called a gradient PS. Dij case j!=0 has parenthesis infront of triangle Unrolling parameters Calculate the gradientVec by function [jval, gradientVec] = costFunction(thetaVec) ; Dvec is gradientVec Gradient checking - To check if for prop or back prop is correct. - This is only for making sure that Dvec is working fine. - We do not use this gradApprox(numerical gradient computation) instead of Dvec because it is slower. Random Initialization (initial theta) Description When we backprop, all nodes will update to the same value repeatedly. So, it is like we are considering a1,a2,… as the same feature. It’s redundant. Solved by random initialization Training a Neural Network To improve learning algorithm - Get more training examples → fix high variance - Try smaller sets of features → fix high variance - Try getting more features → fix high bias - Try adding polynomial features → fix high bias - Try decreasing lambda → fix high bias - Try increasing lambda → fix high variance Evaluate a hypothesis in learning algorithm 1. Split into test set and training set 2. Learn Θ and minimize Jtrain(Θ) using the training set 3. Compute the test set Jerror (Θ) New idea: split data set into : Training set, cross validation set, test set. Cross validation data set - this is used to select a model We can now calculate three separate error values for the three different sets using the following method: 1. Optimize the parameters in Θ using the training set for each polynomial degree. 2. Find the polynomial degree d with the least error using the cross validation set. 3. Estimate the generalization error using the test set with Jtest (Θ(d)), (d = theta from polynomial with lower error); Diagnose whether it’s high bias(underfit) or high variance(overfit) by degree of polynomial Select the best lambda for regularization In order to choose the model and the regularization term λ, we need to: Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}); Create a set of models with different degrees or any other variants. Iterate through the λs and for each λ go through all the models to learn some Θ. Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ) without regularization or λ = 0. 5. Select the best combo that produces the lowest error on the cross validation set. 1. 2. 3. 4. 6. Using the best combo Θ and λ, apply it on JTest (Θ) to see if it has a good generalization of the problem. Learning curves Error Analysis The recommended approach to solving machine learning problems is to: • Start with a simple algorithm, implement it quickly, and test it early on your cross validation data. • Plot learning curves to decide if more data, more features, etc. are likely to help. • Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made. Precision/Recall high precision, high recall = good algorithm Trade-off precision recall - It is hard to get high precision and high recall at the same time - We need to vary threshold (htheta(x)) - high threshold → high precision, low recall - low threshold → low precision, high recall Compare algorithm 1) Evaluate F1 = (2PR)/(P+R) so 0 <= F1 <= 1 P: prediction, R: Recall 2) Select the algorithm that gives the highest value of F 1 Support Vector Machine The last line is the cost function of SVM Hypothesis of SVM: give either 1 or 0 SVM can be called Large margin classifier. It chooses the separator that gives the largest margin. In this case it chooses the black line instead. The magenta line will be the result if C is large enough. As you can see, if there is no X at the bottom left. Then, the black line will be the result which is wrong. p(i) is a projection of x(i) onto Θ. Hypothesis Θtx(i) = p(i).||Θ|| The bottom left: when project X on Θ, it gives a small p (i) so It needs a huge Θ. But our goal is to minimize Θ. So the result will be the green line in the right graph. Kernels The meaning of x is similar to l(i) Predict y by SVM Predict 1 when Θ + Θ1f1 + … > 0 How to choose landmark - choose landmark by use all x(i) Since n is the number of feature, but now n = m When m is very large, like 100,000++ The second line is an optimization way for SVM to run more efficiently. SVM parameter How to choose kernel Feature scaling do before do SVM Other kernels - Polynomial kernel: k(x,l) = (xTl + constant)degree - String kernel: input data type is string Multi-class classification - use one vs all method - Train K SVMs and get Θ(i) in K classes that gives the highest (Θ(i))Tx What Classifier should be used