Uploaded by Muideen Isiaka

IBAN Notes-2

advertisement
It is time for the third building blockof the machine learning algorithm the objective function.The
objective function is the measure usedto evaluate how well the model's outputsmatch the desired
correct values.In this lesson, we will elaborate on that.Objective functions are generally split into
twotypes loss functions and reward functions.Loss functions are also called cost functions.The lower the
loss function, the higherthe level of accuracy of the model.Most often, we work with loss functions.An
intuitive example is a loss functionthat measures the error of prediction.We want to minimize the
errorof prediction, thus minimize the loss.Reward functions, on the other hand, arebasically the
opposite of loss functions.The higher the reward function, the higherthe level of accuracy of the
model.Usually, reward functions are used in reinforcement learning, wherethe goal is to maximize a
specific result.Remember the algorithm we mentioned earlier,the one playing Super Mario?The score
obtained by the algorithm whileplaying the game is the reward function.Maximizing the final score
wouldmean maximizing the reward function. Alright.When dealing with supervised learning,we normally
encounter loss functions.Therefore, in this course, we'll deal mostly with them.In our next video, we will
explorethe two most common loss functions. Thanks for watching.
Earlier, we divided supervised learning intotwo types regression and classification.We will take the same
approach here and considertwo of the most common types of loss functions.Each is used with one ofthe
two types of supervised learning.Note that the objective function is a separateblock in our framework
from the model.That is to say that what we aregoing to discuss now is generally true forall models,
regardless of their linearity. Okay.First, we should define another conceptcalled the target denoted by
t.The target is essentially the desiredvalue at which we are aiming.Generally, we want our output y to
beas close as possible to the target t.In the cats and dogs example we've beenemploying so far, the
targets would be thelabels we assign to each photo.So we are 100% sure these values are correct.They
are the values we aspire to.The y values are the outputs of our model.The machine learning algorithm
aims to finda function of x that outputs valuesas close to the targets as possible.Using this new notation,
the loss function evaluatesthe accuracy of the outputs regarding the targets.All right, let's see the
twocommon functions we talked about.First, we will talk about regressions.I'd like to remind you that
theoutputs of a regression are continuous numbers.A commonly used loss function is the squaredloss,
also called l two norm loss.In the machine learning realm, themethod for calculating it equals theleast
squares method used in statistics.Mathematically, it looks like this the sumof the square differences
between the outputvalues y and the targets t.Naturally, the lower this sum is,the lower the error of
prediction.Therefore, the lower the cost function.Okay, we will check out a common lossfunction for CL
classification in our next lesson.
Hi and welcome back.What about classification?We discussed that the output ofa regression is a
number.But for classification, things are different.Since the outputs are categories like cats anddogs, we
need a better suited strategy.The most common loss function used for classification iscrossentropy, and
it is defined as L of Yand T equals minus the sum of the targetstimes the natural log of the outputs.Time
for an example.Before I lose your interest, let'sconsider our cats and dogs problem.This time we will
have a third category horse.Here's an image labeled as dog.The label is the target.But how does it look
in numerical terms?Well, the target vector T for this photo would bethe first zero means it is not a
cat.The one shows it is a dog, and thethird zero indicates it is not a horse.Okay, let's examine a different
image.This time it will be labeled horse.Its target vector is one.Imagine the outputs of our model for
thesetwo images are 04040 two for the firstimage and 01020 seven for the second.After some machine
learning transformations, these vectors showthe probabilities for each photo to be acat, a dog or a
horse.We will learn how to createthese vectors later in the course.For now, we just need to know how
to interpret them.The first vector shows that according to ouralgorithm, there is a 0.4 or a 40%chance
that the first photo is a cat.40% it is a dog, and 20% it is a horse.So that's the interpretation of these
vectors.What about the crossentropy of each photo?The cross entropy loss for the first image isminus
zero times natural log of zero four minusone times natural log of zero four minus zerotimes natural log
of zero two.This equals approximately zero 92,the cross entropy loss.The second image is minus zero
times natural logof zero one minus zero times the natural logof zero two minus one times the natural
logof zero seven, which equals approximately 00:36.As we already know, the lower theloss function or
the cross entropy, inthis case, the more accurate the model.So what's the meaning of these two cross
entropies?They show the second loss islower, therefore its prediction is superior.This is what we
expected for the first image.The model was not sure if the photowas of a dog or a cat.There was an
equal 40% probability for both options.We can oppose this to the second photo wherethe model was
70% sure it was a horse.Thus the cross entropy was lower.Okay, an important note is that with
classification, ourtarget vectors consist of a bunch of zeros anda one which indicates the correct
category.Therefore, we could simplify the above formulas too.Minus the log of the probability ofthe
output for the correct answer.Here's an illustration of howour initial formulas would change.
Alright.Those were examples of commonly usedloss functions for regression and classification.Most
regression and classification problemsare solved by using them.But there are other loss functions
thatcan help us resolve a problem.We must emphasize that any function that holds thebasic property of
being higher for worse results andlower for better results can be a loss function.We will often use this
observation when coding.It will all become clear when we see them in action.That's all for now.Thanks
for watching. Bye.
We have reached the last piece ofthe puzzle before we can start buildingour first machine learning
algorithm.So far, we have learned at least conceptuallyhow to input data into a model andmeasure how
close to the targets are theoutputs we obtain through the objective function.However, the actual
optimization process happens whenthe optimization algorithm varies the model's parametersuntil the
loss function has been minimized.In the context of the linear model,this implies varying W and B.Okay,
the simplest and the most fundamentaloptimization algorithm is the gradient descent.I would like to
remind you that the gradientis the multivariate generalization of the derivative concept.Let's first
consider a nonmachine learning example.To understand the logic behind the gradient descent.Here is a
function F of x equal to fivetimes x squared plus three times x minus four.Our goal is to find the
minimum ofthis function using the gradient descent methodology.The first step is to findthe first
derivative of the function.In our case, it is ten times x plus three.The second step would beto choose any
arbitrary number.For example, x zero equals four.X zero is the proper way to say x zero.Then we
calculate a different number x one.Following the update rule, xi plus one equals xi minusETA times the
first derivative of the function at xix one is equal to four minus ETA times tentimes four plus three, or
four minus ETA times 43.So what is ETA?This is the learning rate.It is the rate at which themachine
learning algorithm forgets old beliefs.For new ones, we choose thelearning rate for each case.By the end
of this lecture, theconcept of ETA will be clearer.Using the update rule, we can findx two, x three, and so
on.After conducting the update operation long enough,the values will eventually stop updating.That is
the point at which we knowwe have reached the minimum of the function.This is because the first
derivative of the functionis zero when we have reached the minimum.So the update rule xi plus one
equals xi minus ETAtimes the first derivative at xi will become xi plus oneequals xi minus zero, or xi plus
one equals xi.Therefore, the update rule will no longer update.Let's illustrate this with an example.Let's
take an ETA of zero zero one.We start descending.X one is equal to 357, x twois equal to 318, and so
on.Around the 85th observation, we seeour sequence doesn't change anymore.It has converged to
minus zero three.Once the minimum is reached, allsubsequent values are equal to it.Since our update
rule has become xiplus one equals xi minus zero.Graphically, the gradient descent looks likethis we start
from any arbitrarypoint and descend to the minimum.All right, the speed ofminimization depends on
the ETA.Let's try with an ETA of zero one,we have converged to the minimum of minuszero three after
the first iteration.Now, knowing the minimum is minus zero three,let's see an ETA of ZeroZero one.This
step is so small that we need approximately 900iterations before we reach the desired value we
descend tothe same extremum but in a much slower manner.Finally I'll try with an ETA of zero twowe
obtain a sequence of four and -4.6 untilinfinity no matter how many iterations we execute oursequence
will never reach -0.3 we already know minuszero three is the desired value.But if we didn't we would be
deceived.This situation is called oscillation.We bounce around the minimum value,but we never reach
it.We can use four or minus four six inthe algorithm, but this won't be its true minimum.Graphically, we
are stuck into thesetwo points never reaching the minimum.Now that we have seen different learning
rates andtheir performance, let's state this rule generally, we wantthe learning rate to be high enough
so wecan reach the closest minimum after repeating the operationin a rational amount of time.So
perhaps 0.1 was too small for this function.At the same time, we want ETA tobe low enough so we are
sure wereach the minimum and don't oscillate around it.Like in the case where we chose an ETA of zero
two.In the sections in which we will study deeplearning, we will discuss a few smarter techniques
thatwould allow us to choose the right rate.All right, there are severalkey takeaways from this
lesson.First, using gradient descent, we can find the minimum valueof a function through a trial and
error method.That's just how computers think.Second, there is an update rule that allows us to
cherrypick the trials so we can reach the minimum faster.Each consequent trial is better than
theprevious one with a nice update rule.Third, we must think about the learning rate whichhas to be
high enough so we don't iterateforever and low enough so we don't oscillate forever.Finally, once we
have converged,we should stop updating.Or as we will see in thecoding example, we should break the
loop.One way to know we have converged is when thedifference between the term at place I plus one
andplace I is zero ZeroZero one once again that's atopic we'll see in more detail later.Please download
and look at the Excelfile associated with the gradient descent availablein the Course Resources
section.We encourage you to play around withthe learning rate or the arbitrarily chosennumber x not
and see what happens.This will give you a good intuition about thelearning rate, which is central to
teaching the algorithm.In the next lesson, we will generalize thisconcept to the and parameter gradient
descent.
Download