Machine Learning Objective Functions & Gradient Descent

It is time for the third building blockof the machine learning algorithm the objective function.The objective function is the measure usedto evaluate how well the model's outputsmatch the desired correct values.In this lesson, we will elaborate on that.Objective functions are generally split into twotypes loss functions and reward functions.Loss functions are also called cost functions.The lower the loss function, the higherthe level of accuracy of the model.Most often, we work with loss functions.An intuitive example is a loss functionthat measures the error of prediction.We want to minimize the errorof prediction, thus minimize the loss.Reward functions, on the other hand, arebasically the opposite of loss functions.The higher the reward function, the higherthe level of accuracy of the model.Usually, reward functions are used in reinforcement learning, wherethe goal is to maximize a specific result.Remember the algorithm we mentioned earlier,the one playing Super Mario?The score obtained by the algorithm whileplaying the game is the reward function.Maximizing the final score wouldmean maximizing the reward function. Alright.When dealing with supervised learning,we normally encounter loss functions.Therefore, in this course, we'll deal mostly with them.In our next video, we will explorethe two most common loss functions. Thanks for watching. Earlier, we divided supervised learning intotwo types regression and classification.We will take the same approach here and considertwo of the most common types of loss functions.Each is used with one ofthe two types of supervised learning.Note that the objective function is a separateblock in our framework from the model.That is to say that what we aregoing to discuss now is generally true forall models, regardless of their linearity. Okay.First, we should define another conceptcalled the target denoted by t.The target is essentially the desiredvalue at which we are aiming.Generally, we want our output y to beas close as possible to the target t.In the cats and dogs example we've beenemploying so far, the targets would be thelabels we assign to each photo.So we are 100% sure these values are correct.They are the values we aspire to.The y values are the outputs of our model.The machine learning algorithm aims to finda function of x that outputs valuesas close to the targets as possible.Using this new notation, the loss function evaluatesthe accuracy of the outputs regarding the targets.All right, let's see the twocommon functions we talked about.First, we will talk about regressions.I'd like to remind you that theoutputs of a regression are continuous numbers.A commonly used loss function is the squaredloss, also called l two norm loss.In the machine learning realm, themethod for calculating it equals theleast squares method used in statistics.Mathematically, it looks like this the sumof the square differences between the outputvalues y and the targets t.Naturally, the lower this sum is,the lower the error of prediction.Therefore, the lower the cost function.Okay, we will check out a common lossfunction for CL classification in our next lesson. Hi and welcome back.What about classification?We discussed that the output ofa regression is a number.But for classification, things are different.Since the outputs are categories like cats anddogs, we need a better suited strategy.The most common loss function used for classification iscrossentropy, and it is defined as L of Yand T equals minus the sum of the targetstimes the natural log of the outputs.Time for an example.Before I lose your interest, let'sconsider our cats and dogs problem.This time we will have a third category horse.Here's an image labeled as dog.The label is the target.But how does it look in numerical terms?Well, the target vector T for this photo would bethe first zero means it is not a cat.The one shows it is a dog, and thethird zero indicates it is not a horse.Okay, let's examine a different image.This time it will be labeled horse.Its target vector is one.Imagine the outputs of our model for thesetwo images are 04040 two for the firstimage and 01020 seven for the second.After some machine learning transformations, these vectors showthe probabilities for each photo to be acat, a dog or a horse.We will learn how to createthese vectors later in the course.For now, we just need to know how to interpret them.The first vector shows that according to ouralgorithm, there is a 0.4 or a 40%chance that the first photo is a cat.40% it is a dog, and 20% it is a horse.So that's the interpretation of these vectors.What about the crossentropy of each photo?The cross entropy loss for the first image isminus zero times natural log of zero four minusone times natural log of zero four minus zerotimes natural log of zero two.This equals approximately zero 92,the cross entropy loss.The second image is minus zero times natural logof zero one minus zero times the natural logof zero two minus one times the natural logof zero seven, which equals approximately 00:36.As we already know, the lower theloss function or the cross entropy, inthis case, the more accurate the model.So what's the meaning of these two cross entropies?They show the second loss islower, therefore its prediction is superior.This is what we expected for the first image.The model was not sure if the photowas of a dog or a cat.There was an equal 40% probability for both options.We can oppose this to the second photo wherethe model was 70% sure it was a horse.Thus the cross entropy was lower.Okay, an important note is that with classification, ourtarget vectors consist of a bunch of zeros anda one which indicates the correct category.Therefore, we could simplify the above formulas too.Minus the log of the probability ofthe output for the correct answer.Here's an illustration of howour initial formulas would change. Alright.Those were examples of commonly usedloss functions for regression and classification.Most regression and classification problemsare solved by using them.But there are other loss functions thatcan help us resolve a problem.We must emphasize that any function that holds thebasic property of being higher for worse results andlower for better results can be a loss function.We will often use this observation when coding.It will all become clear when we see them in action.That's all for now.Thanks for watching. Bye. We have reached the last piece ofthe puzzle before we can start buildingour first machine learning algorithm.So far, we have learned at least conceptuallyhow to input data into a model andmeasure how close to the targets are theoutputs we obtain through the objective function.However, the actual optimization process happens whenthe optimization algorithm varies the model's parametersuntil the loss function has been minimized.In the context of the linear model,this implies varying W and B.Okay, the simplest and the most fundamentaloptimization algorithm is the gradient descent.I would like to remind you that the gradientis the multivariate generalization of the derivative concept.Let's first consider a nonmachine learning example.To understand the logic behind the gradient descent.Here is a function F of x equal to fivetimes x squared plus three times x minus four.Our goal is to find the minimum ofthis function using the gradient descent methodology.The first step is to findthe first derivative of the function.In our case, it is ten times x plus three.The second step would beto choose any arbitrary number.For example, x zero equals four.X zero is the proper way to say x zero.Then we calculate a different number x one.Following the update rule, xi plus one equals xi minusETA times the first derivative of the function at xix one is equal to four minus ETA times tentimes four plus three, or four minus ETA times 43.So what is ETA?This is the learning rate.It is the rate at which themachine learning algorithm forgets old beliefs.For new ones, we choose thelearning rate for each case.By the end of this lecture, theconcept of ETA will be clearer.Using the update rule, we can findx two, x three, and so on.After conducting the update operation long enough,the values will eventually stop updating.That is the point at which we knowwe have reached the minimum of the function.This is because the first derivative of the functionis zero when we have reached the minimum.So the update rule xi plus one equals xi minus ETAtimes the first derivative at xi will become xi plus oneequals xi minus zero, or xi plus one equals xi.Therefore, the update rule will no longer update.Let's illustrate this with an example.Let's take an ETA of zero zero one.We start descending.X one is equal to 357, x twois equal to 318, and so on.Around the 85th observation, we seeour sequence doesn't change anymore.It has converged to minus zero three.Once the minimum is reached, allsubsequent values are equal to it.Since our update rule has become xiplus one equals xi minus zero.Graphically, the gradient descent looks likethis we start from any arbitrarypoint and descend to the minimum.All right, the speed ofminimization depends on the ETA.Let's try with an ETA of zero one,we have converged to the minimum of minuszero three after the first iteration.Now, knowing the minimum is minus zero three,let's see an ETA of ZeroZero one.This step is so small that we need approximately 900iterations before we reach the desired value we descend tothe same extremum but in a much slower manner.Finally I'll try with an ETA of zero twowe obtain a sequence of four and -4.6 untilinfinity no matter how many iterations we execute oursequence will never reach -0.3 we already know minuszero three is the desired value.But if we didn't we would be deceived.This situation is called oscillation.We bounce around the minimum value,but we never reach it.We can use four or minus four six inthe algorithm, but this won't be its true minimum.Graphically, we are stuck into thesetwo points never reaching the minimum.Now that we have seen different learning rates andtheir performance, let's state this rule generally, we wantthe learning rate to be high enough so wecan reach the closest minimum after repeating the operationin a rational amount of time.So perhaps 0.1 was too small for this function.At the same time, we want ETA tobe low enough so we are sure wereach the minimum and don't oscillate around it.Like in the case where we chose an ETA of zero two.In the sections in which we will study deeplearning, we will discuss a few smarter techniques thatwould allow us to choose the right rate.All right, there are severalkey takeaways from this lesson.First, using gradient descent, we can find the minimum valueof a function through a trial and error method.That's just how computers think.Second, there is an update rule that allows us to cherrypick the trials so we can reach the minimum faster.Each consequent trial is better than theprevious one with a nice update rule.Third, we must think about the learning rate whichhas to be high enough so we don't iterateforever and low enough so we don't oscillate forever.Finally, once we have converged,we should stop updating.Or as we will see in thecoding example, we should break the loop.One way to know we have converged is when thedifference between the term at place I plus one andplace I is zero ZeroZero one once again that's atopic we'll see in more detail later.Please download and look at the Excelfile associated with the gradient descent availablein the Course Resources section.We encourage you to play around withthe learning rate or the arbitrarily chosennumber x not and see what happens.This will give you a good intuition about thelearning rate, which is central to teaching the algorithm.In the next lesson, we will generalize thisconcept to the and parameter gradient descent.

Machine Learning Objective Functions & Gradient Descent

Related documents

Products

Support

Machine Learning Objective Functions & Gradient Descent

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib