Uploaded by Newbie SRL

Summary 1-7

advertisement
Notation
m – number of training examples
x – input variable / feature
y – output variable / target variable
(x(i),y(i)) – the ith data
a := a+1 – assign a+1 to a
Supervised Learning
- regression problem
- classification problem ex. Label spam email
Unsupervised Learning
- clustering problem
- non-clustering problem ex. Cocktail party
Linear Regression
(J) cost function – error function (function to evaluate the error occurred from the used function.
Gradient descent algorithm
- for minimizing
alpha – learning rate ( how fast we step down the hill)
batch gradient descent – when apply gradient descent to a function that look on every training sample
so in this case, it is called batch gradient descent.
2
Running time : O(kn )
Linear regression with multiple features
= θTx
Feature scaling
- Make gradient descent run faster
- making the scale between feature and output to be in the appropriate value
ex. X1: buying cost of house (1,000,000 bath+)
X2: time (1 year, 2 years +)
Y: selling cost of house (600,000 bath+)
Mean normalization
replace xi with xi – ᶣ to make zero mean
x1 ← (x1 - ᶣ)/s ; s: can be SD or max-min
Learning Rate
How to choose
- as we know, the lower alpha → the higher iteration → higher chance to get min J(θ)
- Test when min J(θ) converge →
automatic convergence test→ if J(θ) decreases by less than 10-3 in one iteration.
- Try choosing 0.001, 0.01, 0.1 , 1,…
Polynomial Regression
For example h(θ) = θ1x1 + θ2x12 + θ3x13
- create feature x2 as x12 and x3 as x13
Normal Equation
θ = (XTX)-1XTy
→ get min J(θ)
Octave : pinv(X’*X)*X’*y
Runing time : O(n3)
If XTX is non-invertible
- Redundant function
- Too many features (m<=n)
Delete some features or use regularization
Classification Algorithm
Logistic Regression
Sigmoid function/Logistic function : g(z) = 1 / (1 + e-z) ; z = θTx = h(θ)
now our prediction is g(h(x)) which is 1 / (1 + e-h(x))
hθ(x) will give us the probability that our output is 1
h θ(x) >= 0.5 → g( θTx) >= 0.5 → θTx >= 0
Decision Boundary – a line classifying a group
Non-linear Decision Boundary – do like polynomial regression
Cost function
cost(h θ(x(i)),y(i)) = -log(h θ(x)) if y = 1
- log(1 - h θ(x)) if y=0
Interpreting : Cost = 0, if y =1 and h θ(x) = 1
Interpreting : Cost = infinity, if y =0 and h θ(x) = 1
Simplified cost function
cost(h θ(x(i)),y(i)) = -ylog(h θ(x)) - (1-y)log(1 - h θ(x))
J(θ) = (1/m)( sum of from 1 to m -y(i)log(h θ(x(i))) – (1-y(i) )log(1 - h θ(x(i)) ))
Applying Gradient Descent
→
where h θ(x(i)) = 1 / (1 + e-θTx)
Advanced Optimization algorithm
- Gradient Descent
- Conjugate gradient
\
- BFGS
> much faster than Gradient Descent
- L-BFGS
/
Create my own function to compute Cost function of a set of theta
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
Built in function in Octave
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
// options is kind of setting value in this case “gradObj on” → provide gradient, “MaxIter 100” → 100
times
// fminunc : function minimization unconstrained providing optimal theta, in case there is only one theta
fminunc cannot be used
Multiclass classification
One vs all : consider the other classes as the same class.
Then, we got n classfiers: htheta(i). → Train each classifier with those m datas → To make a prediction, which one
give the most probability, that one is the result.
Regularization
Overfitting problem
undefit model: high bias
overfit model: high variance
too many features, fail to generalize new examples
Options to solve:
1. Reduce the number of features: manually or model selection algorithm.
2. Regularization: keep all value but reduce the value of theta.
This is a regularized linear regression and also a cost function.
The last expression lambda is called regularization parameter.
This is done to increase the cost so that it selects the other theta value that causes less cost.
The higher lambda is, the less it overfit. So, lambada must be carefully chosen well.
Regularized Linear regression
Gradient Descent
We will modify our gradient descent function to separate out θ 0 from the rest of the parameters because we do
not want to penalize θ0.
Note: alpha*lambda / m will always be less than 1
Normal equation
The inside matrix is possible to prove that it is invertible
Regularized Logistic Regression
Gradient Descent
Similar with regularized linear regression but different hypothesis equation.
Neural Network
Terminology
ai(j) = activation of unit I in layer j
θ(j) = matrix of weight controlling function mapping from layer j to layer j+1
h can be called an activation function.
If network has sj units in layer j and sj+1 in layer j+1, then Θ(j) will be of dimension sj+1 x (sj + 1).
The +1 comes from bias node.
g function is a sigmoid function.
Forward propagation
To make it short, a1(j) = g(z1(j)).
Θij is the multiplicand for aj in the current layer to ai in the next layer.
Example Let a(1) = x in Rn+1 , denote the input a0(1) = 1
a(2) can be computed by z(2) = Θ(1) x a(1); a(2) = g(z(2))
This theta is represented the function of or function.
Multiclass classification
Use one vs all method.
The hΘ(x) look like this [0; 1; 0; 0]
The training set: (x(1),y(1)), (x(2),y(2)),… , (x(m),y(m))
y(i) look like this [1; 0; 0; 0] , [0; 1; 0; 0], [0; 0; 1; 0] ,...
Cost function
L = total no. of layers in network
sl = no. of unit in layer l (not count bias)
Backpropagation Algorithm
g’ is a differential function of the sigmoid function.
Which is g’(z) = g(z) * g(1-z)
z(i) is the sum of product of theta(i-1) and a(i-1)
The term “Backpropagation” comes from we compute the error from the back to the front.
That triangle is called delta l I j
D(l) can be called a gradient
PS. Dij case j!=0 has parenthesis infront of triangle
Unrolling parameters
Calculate the gradientVec by
function [jval, gradientVec] = costFunction(thetaVec)
; Dvec is gradientVec
Gradient checking
- To check if for prop or back prop is correct.
- This is only for making sure that Dvec is working fine.
- We do not use this gradApprox(numerical gradient computation) instead of Dvec because it is slower.
Random Initialization
(initial theta)
Description
When we backprop, all nodes will update to the same
value repeatedly.
So, it is like we are considering a1,a2,… as the same
feature. It’s redundant.
Solved by random initialization
Training a Neural Network
To improve learning algorithm
- Get more training examples → fix high variance
- Try smaller sets of features → fix high variance
- Try getting more features
→ fix high bias
- Try adding polynomial features → fix high bias
- Try decreasing lambda
→ fix high bias
- Try increasing lambda
→ fix high variance
Evaluate a hypothesis in learning algorithm
1. Split into test set and training set
2. Learn Θ and minimize Jtrain(Θ) using the training set
3. Compute the test set Jerror (Θ)
New idea: split data set into : Training set, cross validation set, test set.
Cross validation data set
- this is used to select a model
We can now calculate three separate error values for the three different sets using the following method:
1. Optimize the parameters in Θ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with Jtest (Θ(d)), (d = theta from polynomial
with lower error);
Diagnose whether it’s high bias(underfit) or high variance(overfit) by degree of polynomial
Select the best lambda for regularization
In order to choose the model and the regularization term λ, we need to:
Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
Create a set of models with different degrees or any other variants.
Iterate through the λs and for each λ go through all the models to learn some Θ.
Compute the cross validation error using the learned Θ (computed with λ) on the JCV(Θ)
without regularization or λ = 0.
5. Select the best combo that produces the lowest error on the cross validation set.
1.
2.
3.
4.
6. Using the best combo Θ and λ, apply it on JTest (Θ) to see if it has a good generalization of the
problem.
Learning curves
Error Analysis
The recommended approach to solving machine learning problems is to:
• Start with a simple algorithm, implement it quickly, and test it early on your cross validation
data.
• Plot learning curves to decide if more data, more features, etc. are likely to help.
• Manually examine the errors on examples in the cross validation set and try to spot a trend
where most of the errors were made.
Precision/Recall
high precision, high recall = good algorithm
Trade-off precision recall
- It is hard to get high precision and high recall at the same time
- We need to vary threshold (htheta(x))
- high threshold → high precision, low recall
- low threshold → low precision, high recall
Compare algorithm
1) Evaluate F1 = (2PR)/(P+R)
so 0 <= F1 <= 1
P: prediction, R: Recall
2) Select the algorithm that gives the highest value of F 1
Support Vector Machine
The last line is the cost function of SVM
Hypothesis of SVM: give either 1 or 0
SVM can be called Large margin classifier.
It chooses the separator that gives the largest margin.
In this case it chooses the black line instead.
The magenta line will be the result
if C is large enough.
As you can see, if there is no X at
the bottom left. Then, the black
line will be the result which is
wrong.
p(i) is a projection of x(i) onto Θ.
Hypothesis Θtx(i) = p(i).||Θ||
The bottom left: when project X on Θ, it gives a small p (i) so It needs a huge Θ. But our goal is to minimize Θ.
So the result will be the green line in the right graph.
Kernels
The meaning of x is similar to l(i)
Predict y by SVM
Predict 1 when Θ + Θ1f1 + … > 0
How to choose landmark
- choose landmark by use all x(i)
Since n is the number of feature, but now n = m
When m is very large, like 100,000++
The second line is an optimization way for
SVM to run more efficiently.
SVM parameter
How to choose kernel
Feature scaling
do before do SVM
Other kernels
- Polynomial kernel: k(x,l) = (xTl + constant)degree
- String kernel: input data type is string
Multi-class classification
- use one vs all method
- Train K SVMs and get Θ(i) in K classes that gives the highest (Θ(i))Tx
What Classifier should be used
Download