Uploaded by Monish Tanwani

ML - Toppers Solutions 2019

advertisement
—
a
~
2
e.
es
fas!
4
|
Fe
fi
|
!
te geen
ro
f
:
i
*
E
ti
u
i
ie
5
;
fe
ae
j
{
a
¥
4 i
B
i
Rea
;
3
¥
a
I
a
’
ZO19
i
ya
f
4
ate
mc
AHitian
;
- EGItion
|
;
rae
FFT
id
j
5
|
B
—,
cs
MACHINE
BAN
(BE - COMPUTER)
Handcrafted by BackkBenchers
Publications
Scanned by CamScanner
B
Fl
Machine Learning
SEM,
Marks Distribution:
#
“MAY - 16
DEC -16
o.
:
05
0
o
G
4,
15
15
5
30
Ee
7
8.
-
Repeated
MAY - 17
DEC -17
MAY -18
0°
10
“e
15
20
sy
10
is
©
“0
30
20
ae
m
OB
10
10
1S
10
15
10
20
15
10
30
=
3
05
20
15
25
"0
8]
-
55
50
70
90
Bm
-
=
[-———
|
2
10
15
|
f
25
_ PRET)
eS
Marks
:
|
Scanned by CamScanner
Chap -1] Introduction to ML.
CHAP
Ql.
www, Topperssohitions com
- 1: INTRODUCTION TO MACHINE LEARNING
What is Machine
Learning?
Explain how supervised learning la different from unsupervised
learning.
Q2.
Define Machine
Learning? Briefly explain the types of learning,
Ans:
[5M | Mayl7 & Maytf}
MACHINE LEARNING:
1,
Machine learning is an application of Artificial Intelliqence (Al)
2.
It provides
ability to
systems
to automatically
learn
and
improve
from
experience
vothaut
bere
explicitly programmed.
3.
Machine
4,
The
learning teaches
primary
goal
of
cornputer
machine
to do what
learning
comes
is to allow
naturally to butnans
the
systeras
learn
learns
for
autornatically
ex<perierice
with)
fornia
5,
Real life examples: Google search Engine, Amazon.
a
intervention or assistance and adjust actions accordingly
Machine learning is helpful in
a.
Improving business decision
b.
Increase productivity.
c.
Detect
d.
Forecast weather.
disease,
TYPES OF LEARNING:
—
Supervised Learning:
Supervised fearning as the name indicates a presence of supervisor as teacher
Nv
1)
Basically supervised learning is a learning in which we teach or train the machine using date which ts
well labelled.
After that, machine
3.
is provided with new set of oxarmples(data) so that supervised learning el yorthen
analyses the training data
Machine and produces a correct outcome from labelled data,
4,
wn
Supervised learning classified into two categories of algorithms:
a.
Classification.
b.
Regression.
il) Unsupervised learning:
1.
to the tnachine
Unlike supervised learning, no teacher is provided that means no training will be given
2.
Unsupervised
learning
is the training of machine
Using Information
that
is neither
classified mor
labelled.
It allows the algorithm to act on that information without guidance,
4.
Unsupervised learning classified into twa categories of algorithms:
a.
Clustering
b.
Association
“8 Handcrafted by BackkBenchors Publications
Page 1 of 102
i
Scanned by CamScanner
.
ersSolutio NS.com,
www.Topp
Chap -1| (ntroduction to ML
PERVI SED LEARNING:
DIFFERENCE BETWEEN SUPERVISED AND UNSU
Unsupervised learning
[ Supervised learning»
Uses unknown
Uses known and labelled data.
‘Very complex to develop,
Less complex to develop.
Uses off-line analysis.
Uses real-time analysis.
Number of classes are known.
Number
Learning
Applications of Machine
Q4.
Machine learning applications
|
of classes are unknown.
Gives moderate accurate and reliable results.
“Gives accurate and reliable results.
Q3.
:
and unlabelled data.
|
algorithms
[10M | May16, Declé & May17]
Ans:
APPLICATIONS:
1) Virtual Personal
Assistants:
|.
Siri, Alexa, Google Now are some of the popular examples of virtual personal assistants.
2.
As the name
%.
All you need to do is activate them and ask "What is my schedule for today?”, "What are the flights
asked over voice,
suggests, they assist in finding information, when
from Germany
to London", or similar questions,
IN) Image Recognition:
1,
Itis one of the most common
machine
learning applications.
2.
There are many situations where you can classify the object as a digital image.
3%.
For digital images, the measurements
4
Inthe case of a black and white image, the intensity of each pixel serves as one Measurement.
describe the outputs of each
pixel in the image.
IH) Speech Recognition:
i.
Speech recognition (SR) is the translation of spoken words into text.
2.
inspeech
3.
The measurements
recognition, a software application recognizes spoken words.
in this Machine
Learning
application
might
be a set of numbers
that represent
4
Wecan
segment the signal into portions that contain distinct words or phonemes.
un
the speech signal.
In each
segment,
frequency
we can
represent
the speech
signal
by the intensities or energy
in different time-
bands.
IV) Medical Diagnosis:
J.
ML
provides
methods,
techniques,
and
tools that
can
help
in solving
diagnostic
and
prognostic
problems in a variety of medical domains.
2,
Itis being used for the analysis of the importance of clinical parameters and of their combinations for
prognosis.
4.
E.g. prediction of disease progression, for the extraction of medical knowledge for outcomes research
V) Search Engine Result Refining:
1.
Google and other search engines use machi
ne learning to improve the search results
for you
es
? Handcrafted by BackkBenchers Publications
s
j
Page 2 of 102
Scanned by CamScanner
chap -1I Introduction to ML
»
Www.ToppersSolutions.com
Every time you execute a search, the algorithms atthe backend
keep a watch at how you respend to
thre beSUlts
Vi) Statistical Arbitrage:
y}
large
In finance, statistical arbitrage refers to automated short-term trading strategies that involve a
Number
of SOCUHTes,
In such strategies, the user tries to implement
a lading algorithm for a set of securities on the basis
of quanunes
&
These measurements
can be cast
as a classification or estimation
problem,
vil)Learning Associations;
1
Leaning association is the process of developing insights into various associations between products.
>
A goed evample
%
When
is how seemingly unrelated products may reveal an association to one another.
analyzea in relation to buying behaviors of customers.
Vill) Classification:
|
Classification is a process of placing each individual from the population under study in many classes.
2
This is identified as independent variables
Classification Nnelps analysis to use measurements
abrect
“
of an object to identify the category to which that
belongs
Toestablish
an efficient
rule, analysts use data.
IX} Prediction:
1
Consider
the example of a bank computing
the probability of any of loan applicants faulting the loan
repayment.
>
Jacampute the probability of the fault, tne system will first need to ciassify the available data in certain
a
Groups
itis
described
by a set of ruies prescribed by the analysts.
Once
X)
Extraction:
1.
Information Extraction (!E) is another application of machine learning.
toe
4
wodo the classification, as per neea we can compute tne probability.
It is the process of extracting structured information fiom unstructured data.
3
for example web
“4.
The process
Q5.
pages, articles, blogs, business reports, and e-mails.
of extraction
takes
input as a set of documents
and
produces a structured data.
Explain the steps required for selecting the right machine learning algorithm
Ans:
[IOM | MayI6 & DecI7]
STEPS:
!)
Understand Your Data:
.
The type and kind of data we have plays a key rele in deciding which algorithm to use.
2
Some algorithms can work with smaller sample sets while others require tons and tons of samples.
“ Handcrafted by BackkBenchers Publications
Page 3 of 102
Scanned by CamScanner
ws
Chap ~1] Introduction to ML
Www. TOPPersSOlutlons, cig,
3.
Certain algonthins work with certain types of data.
4.
£.a, Naive Raves works well with categorical input
S
ttincliudes
a
Know
but is not at all sensitive to miissitia date
your data:
Look at Summary statistics and visualizations.
©
Percentiles can help identify the range for most of the data,
Averages and medians can describe central tendency,
'
Visualize the data;
Rox plots can identify outliers.
Density blots and histoarams show the spread of data,
Seatter plots can describe bivariate relationships.
ce
Clean your data:
(
Dealwith
missing value
Missing data affects some
©
qd
Missing
data
models more than others.
for certain variables can result in poor
predictions.
Augment your data:
Feature engineering ts the process of going from raw data to data that is ready for modelling
It can serve multiple purposes:
Different models may have different feature engineering
Some
il)
requirements
have built in feature engineenna.
Categorize
the problem:
This is a two-step process
1
Categorize
by input:
a
tf you have lebelled data, it’s a supervised learning problem,
bo
Ifyou have unlabeled data and want to find structure,
If
you
want
reinforcement
2.
to
optimize
learning
an
objective
function
by
it's an Unsupervised
interacting
with
learina
an
problery
environment.
it's a
problem
Categorize by output:
a
Wthe output of your model
is a number, it’s a regression problem.
bo
If the cutput of your model is a class, it's a classification problem,
cv.
It the output of your model
d.
Do you want to detect an anomaly? That's anomaly detection,
is a set of input groups, it’s a clustering problem,
ill) Understand
your constraints:
1.
What is your data storage capacity?
a.
Depending on the storage capacity of your system, you might not be able to store Giqabytes of
classification/regression maouels or gigabytes of data to cluster,
2.
Does the prediction
have to be fast? In real time applications:
a.
For
example, in autonomous
driving, it's inportant that the classification of road signs be as fast
as possible to avoid accidents,
# Handcrafted by BackkBenchers Publications
s
Page 4 of 102
Scanned by CamScanner
ESET
Ghape
4,
| itraduetion to bai.
www. ToppersSolutions.com
Dose the laarnlid linve to be taet)
8
Toad
Ceinstatices,
Update, oh THe Ty, Your
talilhia
tiocdels
troelel witli
qulekly
is necessaty:
sornatines,
you
need
to rapicly
different dataset
IV) Find the available algertthins:
|
Setteat
2
Whether
4A
Tew
A
ayy ae
(he riodebiteets
Saleh
Gloves
V)
Hie fae tare affet thighthe
pile proeeseing
choles of atnodel
are:
the busiiess foals
the mieelel pert
tibete (ie pier des is
eitiabile the prodel te
A
Plevs fast ther tried is
%
Heavy sealable the tectel is
Try eaeh algorithm, access and compare,
Vi) Adjust and combine, optimization techniques.
Vi) Chaase, operate and continuously measure,
Vill) Repent,
AUC CANE
AEKOOER EVECARE UR TINDES HAN TEAR ADE
Q6,
What
b eS
eRe
are the steps in designing
a machine
Bb 1db 65d bd Od
learning
Eads ed Es ae
ed wd Medd eNsesasndawcnmnnadeanecesraseun
problem?
Explain with the checkers
problern,
— fzplain the steps in developing a machine learning application.
OF,
Explain procedure to design machine learning system.
Of,
Ans:
[10M | May17, May18 & Dec18}
CTEPS FOR DEVELOPING IL APPLICATIONS:
I)
Gathering data:
oThis
Goud
aN
2
steg is very aportant
YOu
prediothle
madel
because
the quality of data that you gather
will directly determine
how
valle
Heé hare to collect date fron different sources for our ML
application training purpose.
fis nichides Collecting sarriples by scraping a website and
extracting Cclata from an RSS feed or an
S54.
i
Preparing the data:
Date Dep aration 6 Whete wie load our data inte a suitable place and prepare
it for use in our system
fee Meaning
4
the variedit of Pons Atig has Standard format
is that you use Can mix and matching algorithms and data
CREECES
WN} Choasing a model:
§$,
Rites ates Hatey ttiteiels that the data scientists and researcher have created
over years
Manderatiad bry tach Berichars Publications
Page 5 of 102
ME
Scanned by CamScanner
chap -3 | httaduction to ML
a
4
.
gia
jasbitainees
Ine
WW
oe
ee eeeiellieniians iii
sore
4 ce
for
7 uen
and son
, other for. seq
data
ge
ima
for
ed
suit
well
are
s
of ther
elt.
.
'
of nove ty,
al chats
nist
r rune
lia
eas
\
:
' ing outliers and detection
, identify
ik involves Feceahizitty patterns
IV) Tratning:
i
thaty
ity toto prec predict the datz
ability
ode
our models
i
incrementally improve
ti this step we will Use cur data to
have insetted
Depending
algorithm,
the
on
the
feed
alaarithm
good
clean
data
from
x
previous
4
5faps
vhedde ot fnfotrmation
kine
4.
ar
aly
.
e
|' for
by. a rnachine
usable
is: readily
{he krowledde extracted is stored Ina format that
V)
Evaluation:
to
Giese the taining
is cornplete, it's time to check
if the model
is good
and
r
pe
tiais
eytr:
ana
Steps
next*
$ter
ps
for using evaluation
this is where testina datasets comes into play
3
used for training
Evaluation allows us to test our model aaainst data that has never been
V1) Parameter tuning:
we
are dene with evaluation, we want
1
QRee
2
Weeat de this by tuning cur parameters
to see if we can
further improve
our traning
way
in any
Vit}Prediction:
for some questions
1
Ibis a step whete we det to answer
+
[tis the peint where the value of machine learning ts realized.
CHECKER LEARNING
1
Acotmputer
PROBLEM:
program that learns to play checkers might improve its performance as measured
by 's
ability te win at the class of tasks involving playing checkers games, through experience optained ty
blavinu games against itself.
2
Cheoriny a training expenence:
a.
The type of training experience ‘E’ available to a system can have significant impaci on success oF
failure of the learning system.
b.
One
key
attribute
is whether
regarding the choices made
3.
&
the
training
experience
provides
direct
or
indirect
feedback
by the performance system.
cc.
Second attribute is the degree to which the learner controls the sequence of training exarnples
d.
Another attribute ts the degree to which the learner controls the sequence of training exarnples
Assumptions:
a.
Let us assume that our system will train by playing games against itself.
b.
And itis allowed to generate as much training data as time permits,
Issues Related to Experience:
a.
What type of knowledge/experience should one learn?
b.
Haw to represent the experience?
©
What should be the learning mechanism?
t? Manderafted by BackkBenchers
Publications
Page 6 of 102
Scanned by CamScanner
Chap-1] Introduction to ML
Www.ToppersSolutions.com
Target Function:
5.
a.
Choose Move:B>M
b.
Choose Move ts a function.
c.
where input Bis the set of legal board states and
d
M= Choose Move (B)
produces M which is the set of legal moves.
Representation of Target Function:
6.
a.
x1: the number of white pieces on the board.
o
x2: the number of red pieces on the board.
aa
x3: the number of white kings on the board
©
ny
e.
7.
Tne
x4: the number of red kings on the board.
xS:the number of white pieces threatened by red (i.e, which can be captured on red's next turn)
¥6: the number
of red pieces threatened by white.
F'(b) = Wo + Wixi + Woxe2 + W5Xs + WaXy + WeXs + Woe
problem
of learning
a checkers
strategy
reduces
to the
problem
of learning
values
for
the
coefficients We through We in the target function representation.
Q9.
What are the key tasks of Machine
Learning
Ans:
[5M | Mayl6 & Dec17]
CLASSIFICATION:
1
lf we have data, say pictures of animals, we can classify them.
2.
This animal is a cat, that animal is a dog and soon.
3.
A computer
can
do
the
same
task
using
a Machine
Learning
algorithm
that’s
designed
for
the
classification task.
an
4.
\n the real world, this is used for tasks like voice classification and obiect detection.
This is a Supervised learning task, we give training data to teach the algorithm the classes they belong
to
REGRESSION:
nN
WN
1.
Sometimes you want to predict values.
What are the sales next month? And what is the salary for a job?
Those type of problems are regression problems.
The aim is to predict the value of a continuous response variable.
This is also a supervised
learning task.
CLUSTERING:
J.
Clustering is to create groups of data called clusters.
2.
Observations are assigned to a group based on the algorithm.
3.
This is an unsupervised
4.
Imagining have a bunch of documents on your computer,
the computer will organize them in clusters
based on their content automatically.
learning task, clustering happens fully automatically.
. ® Handcrafted by BackkBenchers Publications
Page 7 of 102
4
Scanned by CamScanner
YADA TOP
_
.
aes
HBP
id
Chap -1 | iitrsdticttisi te Ml
ROrs8 MUtONS. cop.
. ran
—
FEATURE SELEC TIDE:
Wigs
TEP Pet Tf Eer
Beceatiee BbIRTELETI
THis task fs rtjsiittaltt
1.
/
af
jodels
pac
e be
apy
Gt
f
Aue nat andy help He build Mmegsts of hag
oh
Aacclrdcy
2.
Wales Helps tH achiev BhyerHcs (elt beHd Ts Uaigiliviey I Ev
%
tt alea Welpe th Hedtietitey cer HATHel Feat
4.
Techiquies Used fot heatiie selet Hed
[Mbet ebline,
Fe
hapiet itive
TESTING AND MATCHING:
ebbe
1.
Vhhis task telated te briipeitliel [ib ald
3
Peebttrr ariel (ertbebitiied HesEPreselee CTP HEAILE EE Et ETETEL EL By Pf pdt Gere aat
What are the lesties 14 (Aelia DREHER?
O10.
Ansi
{El4 | Decié & Mayley
ISSUES:
tawhat eet tinete oll partietiae aldabit lini Geridide Fa EE
1
Healied FYREHGH, GEN SUTICEDT airing
cata?
What
examples?
r ie Head SPREE Talal
beababel FH
ieebedl
ab
HeprAsANtarians?
Jevirine
ane
ire
fet
exist
typ ebe: ab pared sled
algorithttis
titties pretbest try Hiesed Led odie Le
8
2.
Whitelralaer
A)
How titel Lraltind date fe stliiclent?
Whaticthe beet wiv
&
bo beodtiioe the We Hid
heey ean pitak
6
Wheto ate
7
exainpiles ?
Can prion khowledue
beemedleoddde Hele
be helplabesen
baile Decree ep EO
beg Hite Peet tred quide the pracess of generalizing froa
wien H frat
Whatis the best sltateciy for elieeeiid a Horlibnnel
8
appreximation problen
PEE
appredinnately Corrente?
Taina
@4GHEALE,
aid bay daes ihe chow’ «
yd
ob ihe Pesce naltead pireebaead
Lis atratedy alter (ie cedrpileeity
and Well!
rt HA ATPL Pie TES ahilily 19 represent
aller ihe Pap reSe
altatnatically
learner
the
Howear
the tardel
Futhethon?
Define Well posed learning problein, Hees, define abet driving leaming problem,
QU.
(6M | Decl)
Ans:
WELL-POSED LEARNING PROBLEMS!
1
Accomputel
prodtaiti is sale bo leati roti: eperenici BOWL
performance miecstie EPS
Taped
ba caria dass of bask:
[ans
peri naneccat tacke 4 98 PaBaslind Ly HO inppazes 2aLb espera
Be.
2,
{tidentifies followine (hiee fatto’
a.
Class of tasks.
bo
Meastire of perfortnaiiie te be tnipreeed
c,
Soulce of @xperlehoe.
® Handerafted by BackkBerighers Hublleatians
.
page Hat jal
at
Scanned by CamScanner
Chap -1] Introduction to ML
a.
Learning to classify chernical compounds.
a
Examples:
Learning to drive an autonomous vehicle.
9
3,
www.ToppersSolutions.com
Learning to play bridge.
Learning to parse natural language sentences.
ROBOT DRIVING PROBLEM:
ask k (T): Driving on public, 4-lane highway using vision sensors.
2
erformance measure (P): Average distance travelled before an error (as judged by human overseer)
3, Training experience (E): A sequence of images and steering commands recorded while observing 2
human
driver.
andcrafted by BackkBenchers Publications
Page 9 of 102
Scanned by CamScanner
Wh, Pappas PHSalillins
Hitlong.¢ cay,
hi
Chap -2| Learning with Reuresster
CHAP« 2: LEARNING
REGRESSION
WITH
Luwistic Reuression
QQ
[OM | Mavi, Mayit & bari
Ans:
LOGISTIC REGRESSION:
i
algoellhinn be aulve d Classification problery
aud popilat
Legistic Reateksiet is one of the baste
>
ria With tye labs values)
tric the Go-te method for binaty classification probloris (proble
&
t vallies BOE 1egikte FaGghesslan is Used fey
Linear rearession aluorithee ate Used to predietforecas
classification tasks
ts taker form the Coult fimeton Ural is dsectih Hails riethiodiats Janaification
a,
The tet “Leuistic!
&
The logistic function,
also called the stymiaidl fnctian describe properties OF POpelatlon Groat
in
ecology, fisina auickly and trasing out at the catniig Capacity of Tie eavirniient
G
teal valued Hurober
isan “shaped cutve that Cah lake aby
abel trap ll inte a value betyyear
O andy
but pever exactly at (hose linits
FO
Biaure 2. shows the exatiple of logistie Hanction
Higdte 2h
begistic
banethoan
SIGMOID FUNCTION (LOGISTIC FUNCTION):
1.
Loeistic regression alaorithiy also ases a Enear equation with independent
precictors to precicr
a
value,
The predicted
3.
Weneed
4.
Therefore, we are squashing
vi
2.
value can
be anywhere
berwe
en negative
iifinity CO positive itinity
the cutput of the algotthia to be class variable be, Gena, byes
the outbut
To sauash the predicted value between
of the linear equation into a range of (A, 1,
0 and | Wwe Use the sigmoid fumetion,
Linear Equation:
Pye
Dry
Bey
dices
Sigmoid Function:
Squashed Output +h;
COST FUNCTION:
1h
Since We are tying to predict class values, we cannot
regression alaerithmn
-® Handcrafted by BackkBenchers Publications
use the Same
Cost finetion
-
used in lined!
page oat
ni
Scanned by CamScanner
Wwww.ToppersSolutions.com
Chap - 2] Learning with Regression
Therefore, we use a logarithmic loss function to calculate the cost for misclassifying.
2,
iz
"i
Cost(ha(x),
CALCULATING
vy) =
—log(heGr))
{
—log(Qh
~
ha(e))
ify
=1
ify
=
90
GRADIENTS:
We take partial derivatives of the cost function with respect to each pararneier (theta_O, theta_l, ...) to
1.
obtain the gradients.
With the help of these gradients, we can update the values of theta_O, theta_l, etc.
It does assume linear relationship between the logit of the explanatory variables and the response.
Independent variables can be even the power
4.
terms or some other nonlinear transformations of the
original independent variaples.
dependent
The
5.
variable
NOT
does
binomial
to
be
normaily
family (e.g. binomial,
distribution frorn an exponential
regression assume
need
distributed,
but
Poisson, multinomial,
it typically
assumes
a
normal); binary logistic
distribution of the response.
6.
The homogeneity of variance does NOT need to be satisfied.
7.
Errors need to be independent
but NOT normally distributed,
Q2.
Explain in brief Linear Regression Technique.
Q3.
Explain the concepts behind Linear Regression.
[5M | Mayl16 & Decl7]
Ans:
LINEAR REGRESSION:
1.
Linear Regression is a machine learning algorithm based on supervised learning.
2.
It is a simple machine learning
the target variable is a real
model for regression problems, i.e, when
value.
response y from the predictor variable x.
3
Itis used to predict a quantitative
4.
|t is made with an essumption thai there's a linear relationship between x and y.
8.
This method
is mestly used
for forecasting
and
finding out cause and
effect relationship
between
variables.
6.
Figure 22 shows the example of linear regression.
|
ul
BO
=
'
aa
Figure 2.2: Linear Regression.
7.
The red line in the above graph is referred to as the best fit straight line.
8.
Based on the given data points, we try to plot a line that models the points the best.
9.
For example,
in a simple regression problem
(a single x and a single y). the form of the model would
be:
Y = Aot AMX vrsssee (Linear Equation)
¥ Handcrafted by BackkBenchers Publications
Page 11 of 102
Scanned by CamScanner
Chap ~ 2} Learning with Regression
7
www.Toppers
Ss
olutions.co,
i
rc
ey,
=
:
:
é
IO. The motive of the linear
AG for aG and dal
regression
algorithm is . to find the ’ best5 values
z
COST FUNCTIONS:
1
The cost function helps us to figure out the best possible values for a0 and al which would Provide r,
dest fit line for the data points.
ts
We convert this search problem into a minimization problem where we would
error
like to Minimize tp.
between the predicted value and the actual value
tl
Minimization and cost function is given below:
fm
le
inimiz
minimize
Va
J=
«.
: dolore
Cost function(3) of Linear Regression
9
2 (pre di im)
— yi)”
~yi)
4
is the Root Mean Squared
Error (RMSE)
between
predicted y
value and true y value,
GRADIENT DESCENT:
1.
To update as and a; values in order to reduce Cost function
and achieving the best fit line the mod:
uses Gradient
Descent.
fo
Tne idea is to start with
minimum
Q4.
random
ao and
a values and then iteratively updating
the values, reaching
cost.
Explain Regression line, Scatter plot, Error in Predic
tion and best fitting line
Ans:
[SM | Decié]
REGRESSION
us
ho
1
4.
LINE:
The Regression Line is the line tnat best fits the data,
such that the overall distance from the line tc
the points (variabie values) plotted on a graph is the
smallest.
There are as many numbers of regression lines as variable
s,
Suppose we take two variables, say X and Y, then
there will be two regression lines:
Regression line of Y on X: This gives the most
probable values of Y from the given values
of X.
Y, =a
+ bx
5.
Regression line of X on Y: This gives the
most probable values of X from the given
values of Y.
X,>aaby
SCATTER
DIAGRAMS:
1.
If data is given in pairs then the scatter diagram of the data is
just the points plotted on the xy-plan®
2.
The scatter plot is used to visually identify relati
onships between
paired data.
Fe
@ Handcrafted by BackkBenchers
Publications
the first and
the second
entries ©
“ec cist iia
Page 12 of 102
Scanned by CamScanner
Www.ToppersSolutions.com
Chap-2| Learning with Regression
3. Example:
ts
1
*
.
.
°
e
.
.
.
w
x.
“2
g
2
4
&
3
0
ASE
+.
The scatter plot above represents the age vs. size of a plant.
5
Itisciear from the scatter plot that as the plant ages, its size tends to increase.
ERROR !IN PREDICTION:
1.
The standard error of the estimate is a measure of the accuracy of predictions.
ine standard error of the estimate is closely related to this quantity and is defined below:
ony = pore
Ley
3.
Where os, is the standard
error of the estimate, Y isan actual score, Y' is a predicted score, and N is the
number of pairs of scores.
A line of best fit is a straight line that best represents the data on a scatter piot.
pass through
of the points, none of the points. or ail of the points.
some
bo
~
EST FITTED LINE:
This line may
3.
Figure 2.3 shows the example of best fitted line.
S90
0
sa 6
Figure 2.3: Best Fitted Line Example.
Z
4.
Tne red line in the above graph is referred to as the best fit straight line
~——_—
ȴ Handcrafted by BackkBenchers Publications
Page 13 of 102
Scanned by CamScanner
www.BackkBenche;, ¢
Chap - 2| Learning with Regression
Qs.
The following table shows
Coy,
the midterm
and final exam
grades
obtained
4
-
for Student, in,
{
|
database course.
|
Use the method of least squares using regression to predict the final exam grade of a Stude,
who received 86 on the midterm exam.
Final exam
84
Midterm exam (x)
72
50
Ans:
(y)
63
8]
77
74
78
94
90
86
75
59
49
83
65
33
79
77
52
88
74
8]
90
[10M | Mayi7
Finding x*y and x? using given data:
x
y
x*"y
x?
72
84
6048
5184
50
63
3150
2500
8]
77
6237
6561
74
78
5772
5476
94
90
8460
8836
86
75
6450
7396
59
49
289)
3481
83
79
6557
6889
65
77
5005
4225
1716
1089
6512
7744
7230
6561
33
52
| 88 | 74
8)
Here n
90
= 12 (total number of values tn either x or y)
Now we have to find 3x, dy, >(x*y) and yx?
Where, >x =sum
of all x values
yy = sum of all y values
y(x*ty) = sum
of all x*y values
yx? = sum of all x? values
yx
= 866
sy
= 388
y(x*ty)
3x?
|
|
= 66088
\
= 65942
aa
Handcrafted
by BackkBenchers
Publications
Page 14 of 10?
Scanned by CamScanner
www.BackkBenchers.com
Chap - 2| Learning with Regression
Now we have to finda & b.
_ ne Dxyr Dxdy
.
3=
ne x2-(Lx)?
Putting values in above equation,
_ (12+66088)—(866+888)
(12*65942)—(866)?
a=0.59
Yy-ar Ex
b=
n
_ 888-(0.59+866)
~
12
b= 31.42
Estimating
final exam
grade
of a student who
86 marks = y= a*x +b
received
Here,a = 0.59
lp = 31.42
x= 86
Putung these values in equation of y.
y
- (O.S9 * 86) + 31.42
> 82.16 marks
Q6.
The values of independent variable x and dependent value y are given below:
Find the least square regression line y = ax + b. Estimate the value of y when
x is 10.
x
oO
]
2
3
[10M | May?8]
Ans;
Fin
hag (xty) and x? using given data
[x
y
x"y
x?
0
2
O
O
1
3
3
1
2
s
10
4
3
4
12
4
6
24
16
x or y)
Here n= 5 (total number of values in either
Now we have to find Dx, dy, Diy) and Ex?
Where, ¥x = sum of ail x values
yy = sum of ally values
¥ Handcrafted by BackkBenchers Pubiications
Page 15 of 102
Scanned by CamScanner
WNL BAGKK Be tichers
45,
Chap - 2] Learning with Regression
sane
pcan
-
nen
5 (xty) = sum of all x*y values
Sxé= sun of all x? values
DA
=10
2Y
+20
2ey)
yx
8 49
5 40
Now we have to finda & b.
“one Ea t-(Exy
Putting values in above equation,
DS
(5+49j- (109205
ae
09
enn
en
(Se topo?
bp = bode
n
b
_ 20-(0.9° 10)
f
be22
Estimating final exam grade of a student who received 86 marks =
yeatx+b
Here, a=0.9
b= 2.2
x=
10
Putting these values in equation of y.
y=(09°10)
+ 2.2
wane eee sews seeeeeen ew ere recone
e Cen wn ea Weer cE meee setae we ENE N ORR E REE OBER Rene 8086 Od 205618 EEE OE UE MAS ESSERE SMEARS
What is linear regression? Find best fitted line for following example:
A
1
63
X
y
y
| 127 | 1021
an
BY
|
lal | 1263
3
66
142
138.5
4
69
1S7
157.0
S
69
162
157.0
6
71
156
169.2
7
7
169
169.2
‘4
Q7.
Ta
¥ Handcrafted by BackkEenchers Publications
Page 16 of 102
Scanned by CamScanner
ae”
ltlt—‘“CSO
www.BackkBenchers.com
Chap - 2| Learning with Regression
8
72
165
| 175.4
9
73
181
181.5
10
75
208 | 193.8
[1OM - Dec18]
Ans:
LINEAR REGRESSION:
Refer Q2.
SUM:
Finding x*y and x* using
given data
Here n = 10 (total number
of values
x
63
y
127
xty
800]
x?
3969
G4
12]
T7744)
4096
66
142
@372
4356
69
157
10833
4761
69
162
M78
4761
71
156
O76
5041
71
169
11999
S041
72
165
11880
5184
73
. 181
13213
5329
75
208
15600
S625
in eitner x or y)
ooxwwe have to find ¥x. Sy, ¥(x*y) and 3x?
Where, 5x =sum
of all x values
Sy = sum of all y values
3(xty) = sum of all x*y values
yx? = sum of all x? values
dx
= 693
dy
= 1588
L{xty)
= 10896
2x
= 48163
Now we have to find a & b.
q = Mr dexy- Exdy
ne
Ex?-(Ex)"
Putting values in above equation,
¥ Handcrafted by BackkBenchers Publications
Page 17 of 102
s,
Scanned by CamScanner
>
www.BackkBenchers
Chap = 2 | Learning with Regression
enna
og
et
co CLMEET OHO)
Laat oats
é
(edn intl
a=
)
OSS
«(eure
™, ‘.
hAty
on
(ear?
614
Yyoaehy
te St
I
hehe
"
0
:
ye
'
(i La eeg ty
TBS
b
260,71
Finding best fitting line
yeax
th
here, a = G14
b=
266071
% = Unknown
Putting these values in equation of y,
ye Glatx
= 266,71
Therefore, best fitting line y = 6.14x - 266.71
Graph
for best fitting line:
210
200
190 |
140 |
roo 4
160 4
~
‘
160
140
130
120 +
*
.
“ss
62
ub
70
74
height
‘¥ Handcrafted by BackkBenchers Publications
Page 18 of 102
_ii
Scanned by CamScanner
Chap -3| Learning with Trees
Www.ToppersSolutions.com
—
CHAP
Q1.
What
is decision tree?
- 3: LEARNING
How
you will choose
WITH TREES
best
attribute for decision
tree classifier? Give
suitable example.
[IOM | Deci8]
Ans:
DECISION TREE:
Decision tree is the most powerful and popular tool for classification and prediction.
1.
WH
A Decision tree is a flowchart like tree structure.
Each internal node denotes a test on an attribute.
represents an outcome of the test.
NOUR
Each branch
Each leaf node (terminal node) holds a class label.
Best attribute is the attribute that "best" classifies the available training examples.
There
are
two
- entropy and
8.
Entropy
terms
one
needs
to
be
familiar
with
in
order
to
define
the
"best"
information gain.
is a number
that
represents
how
heterogeneous
a set of examples
is based
on their target
class.
9.
Information gain, on the other hand, shows how much the entropy of a set of examples will decrease
if a specific attribute is chosen.
10.
Criteria for selecting “best” attribute’
a.
Want to get smallest tree.
b.
Choosing the attribute that produces purest nodes.
EXAMPLE:
Predictors
7
ae
Rairy
Overssst
hot
Hign
True
No
Hot
Hign
Fone
Yeo
Hic
Fee
Yoe
teanv
wns
Con
Nermal
Fake
Yoo
funny
Cool
Normal
True
No
Ovarsact
Coot
Necmal
True
Yoo
Foxe
No
cool
Neemai
Foxe
Yet
tunny
Mid
Neemal
Fsue
You
Rainy
Mid
Normal
True
Yee
mp
Ralry
Oversset
mod
Gveraact
Hot
MBs
funny
Higa
High
Normal
High
Decision Tree
Newer
‘tuany
Rainy
Q2.
Target
.
True
Yeo
Fake
You
True
NO
|
as
Fi
Ep
pa
Pea
ae
Explain procedure to construct decision tree
[5M | Deci8]
Ans:
DECISION TREE:
1.
Decision tree is the most powerful and popular tool for classification and prediction,
2.
A Decision tree is a flowchart like tree structure.
3.
Each internal node denotes a test on an attribute.
4.
Each branch represents an outcome of the test.
Page 19 of 102
® Handcrafted by BackkBenchers Publications
Be
a
a
TE
TO TE
TI
PEARL TET a IT TLE
TO
EY
Scanned by CamScanner
www, Topperssolutions.es,
Chap - 3 | Learning with Trees
S.
Each leaf nade (terminal node) halds a class label
6.
Indecision trees, for predicting a class label for a record we stall fom
7.
We compare the values of the root attribute with records attribute
8
On the basis of comparison, we follow the branch corresponding
the haat af the tren
to (hat valle ancl ilinp to tie je.
node.
9.
internal nodes of
We continue comparing our record’s attribute values with other
(he thee
10. We do this comparison untilwe reach a leaf node with precio ted Class valtie
Tl.
As we know how the modelled decision tree can be used to predict the target Class ot (ye value
PSEUDO CODE FOR DECISION TREE ALGORITHM:
1
Place the best attribute of the dataset at the root of the bee
2.
Split the training set into subsets.
3.
Subsets should be made in such a way that each subset contaiis cata with the sane value for an
attribute.
4
Repeat step ]and step 2 on each subset until you find leaf nodes in all the branches oft he tree
CONSTRUCTION
OF DECISION TREE:
[Outlook }
Overenst
Sunny
Rain
Hamidity
High
ml
/\
Strong
Normal
:
4
i
kad
2
\Nu
Z
[ves]
do]
[ye]
Figure 3.1 Example of Decision
1.
Weak
Troe tor
Tennis Play
Figure 3.1 shows the example of decision tree for tennis play,
an attribute value test,
A tree can be “learned” by splitting the source set into subsots baseclon
This process is repeated on each derived subset ina recursive mannher called recursive partitioning
at a node all has the same value of the target vatioble,ot
4.
The recursion is completed when the subset
5.
when splitting no longer adds value to the predictians,
The construction of decision tree classifier does not require any clomain knowledge
of parametel
setting
6.
Therefore is appropriate for exploratory knowleclqe discovery
7.
Decision trees can handle high dimensional data.
8.
In general decision tree classifier has good accuracy,
9,
Decision tree induction is a typical inductive approach to learn knowledge on classificalon
* Handcrafted by BackkBenchers Putlications
page 20 of 10
Scanned by CamScanner
—_—
Chap - 3] Learning with Trees
Q3.
www.ToppersSolutions.com
What are the issues in decision tree induction?
[IOM | Decl6 & May18]
Ans:
JSSUES.IN DECISION TREE INDUCTION:
1)
Instability:
1.
The reliability of the
information
external
at the onset,
information
in the decision
tree depends
2,
Evena small change
3,
Following things will require reconstructing the tree:
on feeding
the precise
internal
and
in input data can at times, cause large changes in the tree.
a,
Changing variables.
b.
Excluding duplication information.
c.
Altering the sequence
midway.
ll) Analysis Limitations:
1.
Among
the major disadvantages of a decision tree analysis is its inherent limitations.
2.
The major limitations include:
a.
Inadequacy in applying regression and predicting continuous values.
b.
Possibility of spurious relationships.
c,
Unsuitability for estimation of tasks to predict values ofa continuous attribute.
a.
Difficulty in representing functions such as parity or exponential size.
1.
Over fitting happens when
Na
Il) Over fitting:
It reduce training set error at the cost of an increased test set error
3.
learning algorithm continues to develop hypothesis
How to avoid over fittingIt stops growing
the tree very early, before it classifies the training
set.
a.
Pre-pruning:
b.
Post-pruning: It aliows tree to perfectly classify the training set ana then prune the tree.
IV) Attributes with many values:
1
Ifattributes have a lot values, then the Gain could select any value for processing.
2.
This reduces the accuracy for classification.
V) Handling
1.
with costs:
Strong quantitative and analytical knowledge required to build complex decision trees.
This raises the possibility of having to train people to complete a complex decision tree analysis.
3.
The costs involved in such training makes decision tree analysis an expensive option.
Vi) Unwieldy:
1.
Decision trees, while providing easy to view illustrations, can also be unwieldy.
2.
Even
data
that is perfectly divided
into classes and
uses only simple threshold
tests may
require
a
large decision tree.
Large trees are not intelligible, and pose presentation difficulties.
4
Drawing decision trees manually usually require several re-draws owing to space constraints at some
sections
5.
There is no fool proof way to precict the number
of branches or spears that emit frem decisions or
sub-decisions.
Handcrafted by BackkBenchers Publications
Page 21 0f 102_
Scanned by CamScanner
Win TOPPETSS olin
Chap - 3| Learning with Trees
utes:
attribg
atin
uous valued
continor
Vil) Incorp
class prediction
The attributes which have continuous values can't have a proper
1.
Forexample, AGE or Temperature can have any values
There is no solution for it until a range is defined in decision tree itself.
2,
a
3,
VII) Handling examples with missing attributes values:
1.
Itis possible to have missing values in training set.
2.ee
To avoid this, most common
value among
examples can be selected for tuple in consideration
IX) Unable to determine depth of decision tree:
1.
If the training set dees not have an end value i.e. the set is given to be continuous.
2.
This can lead to an infinite decision tree building.
X)
Complexity:
1.
Among
2.
Decision trees are easy to use compared to other decision-making models.
3.
Preparing decision trees, especially large ones with many branches, are complex and time-consumr,
the major decision tree disadvantages are its complexity.
affairs,
4.
Computing
probabilities of different possible
branches,
determining
the best split of each
node.
3
Q4.
For
the
given
data
determine
the
entropy
classification separately and find which
after
classification
using
each
attribute fe
attribute is best as decision attribute for the rootbj
finding information gain with respect to entropy of Temperature as reference attribute.
Sr.No | Temperature | Wind
]
Hot
Weak
2
Hot
Strong
3
Mild
Weak
4
Cool
Strong
5
Cool
Weak
6
Mild
| Strong
Humidity
High
High
Normal
High
4
Normai
Norma!
7
Mild
Weak
High
&
9
Hot
Mild
Strong
Weak
High
Normal
io
Hot
Strong
Normal
.
[10M | Mayié}
Ans:
'
First we have to find entropy of all attributes,
1.
Temperature:
There are three distinct values in Temperature which are Hot, Mild and Cool.
As there are three distinct values in reference attribute, Total information gain will be I(p, n, r).
Here, p = total count of Hot = 4
n = total count of Mild = 4
r= total count of cool = 2
;
s=ptn+r=4+4+2=10
Therefore,
ip,n,r)
= ~2log,® ~ “log. — Tlog.=
——"
® Handcrafted by BackkBenchers Publications
ee
page 22 of 10
Scanned by CamScanner
.
-3 | Learninging with Trees
Chap-31
4
4
~—lop,—
10
I(p, 9, )
82 76
www.ToppersSolutions.com
— =4
4
70 8275
= 1 G22 eee
2
5
7p O82
2
10
using calculator
Wind:
2,
There are two distinct values in Wind which are Strong and Weak.
As there are two distinct values in reference attribute, Total inforrnation gain will be I(p, n).
Here, p = total count of Strong =5
n = total count of Weak
=5
s=pt+n=5+5=10
Therefore,
\(p, n)
==
Plog,
—tlogz 2= —— 2tlog.~n
===}oe,5% —
to9,3
7
35 OB2 10
10 O82 75
I(p, n)
3.
2D ceiiceccannataiecatesyes as value of p and n are same, the answer will be 1.
Humidity:
There are two distinct values in Humidity which are High and Normal.
As there are two distinct values in reference attribute, Total information gain will be I(p, n).
Here, p = total count of High =5
n = total count of Normal
=5
s=pt+n=5+5=10
Therefore,
'
i(p, n)
__F
=—tleg25
-_2
Nn
n
— ylog2>
p
5
= 79 O82 Fo
I(p, n)
=]
19
3S
82 75
be 1.
___. as value of p and nare same, the answer will
Temperature
Now we will find best root noae using
ure
Here, reference attribute is Temperat
a.
b.
as reference attribute.
ure which are Hot, Mild and Cool.
There are three distinct values in Temperat
n Ga in for whole data using reference attribute.
Here we will find Total Informatio
c.
will be I(p, n, r).
es in reference attri bute, Total information gain
As there are three distinct valu
d.
Here,
p=total count of Hot = 4
= 4
n= total count of Mild
r= total count of cool = 2
s=pt
Therefore,
I(p, 9,
—__
n+r=44+4+2=10
= —2log,* - “log: > — flog,"
ions
¥ Handcrafted by BackkBenchers Publicat
Page 23 of 102
Scanned by CamScanner
q
www.Topperssolutions.ce,
Chap - 3[ Learning with Trees
——
,
A
2
4
~— log
-~ lo
= —+fog,+
io? 49 - mn 082 330 7 10 OBZ Zo
19, ri,5}
© VGP2
ooceneenee USING Calculator
How ye vill find Information Gain, Entropy and Gain of other attributes except reference attribute
1.
Wind:
Wind attobute have tyo distinct values v/hich are weak and strong
We will find information gain of these distinct values as following
|,
Weak
=
p=
no of Hot values related to weak = 1
n=
no of Mild values related to weak = 3
6 = no of Cool values related to weak = 1
promt
nels
Ser
T 5
therefore,
K(yeeak)
= Hp, nr} =
a2 ® loge ©--
: — flog. a
(weak)
ik
=p, non
“log, 2"
- “log, =
=log: ; 7 “log: -
= 1371.
using calculator
= Strang =
p<
no of Hot values related to strong =3
n=
ne of Mild values related to strong =1
t> no of Cool values related to strong = 1
B2pemtne3ol+1=5
Therefore,
weak) = (p,m. =
—*log,® -—
= ~Zlog,? - *loge+ -
loge ~
log, =
eee USING Calculator
= LSAT
iweak) = (Ip,
“loge ~ -
Therefore,
oO
Wind
Distinct
from Wind
values of Hot)
values of Mild)
vi
Weak
Strang
1
3
(total
related |
related | (total
values | (total
related | Information Gain of value
values of Cool) 1;
1(pp nti)
nj
3
1
137
7
7
Gi
*? Fandcrafted by BackkBenchers Publicat‘ons
-
=o
Page 24 of 10?
Al
Scanned by CamScanner
ate
7
i
chap -31 Learning with Trees
Wwww.ToppersSolutions.com
Now we will find Entropy of Wind as following,
Pltne+r,
Entropy of Wind
Here,
= Yiy
PtNtT=
ptntr
x 1(pi, NyM4)
total count of Hot, Mild and Cold from reference attribute = 10
Pisntr, = total count of related values from above table for distinct values in Wind attribute
(pp nun) = Information gain of particular distinct value of attribute
ae
Entropy of Wind
D,
Pitnjtr;
for weak
— 1+3+1
pentr
xX 1p, NN)
3+141
13714 SE
x 1.371
Sy
Entropy of wind
Pitnitr;
for strong
X (pp npr) +
ptntr
=1371
Gain of wind = Total Information Gain - Entropy of wind = 1.522 - 1.371 = 0.151
2.
Humidity:
Humidity attribute have two distinct values which are High and Normal.
We will find information gain of these distinct values as following
i.
High=
p, = no of Hot values related to High =3
n, = no of Mild values related to High = 1
r= no of Coo! values related to High = 1
S=ptn+n=3+14+1=5
Therefore,
(High) = I(p, n, r) = —Flogs®
- “log. = - “loge =
= —2jog,2
5 lOBze — tlog,+
clOB25 — +log,+
5 OBz;
I(High) = Ip, n,n) = 1.371 ue.
i.
UsINg Calculator
Normal =
p, = no of Hot values related to Norrnal = 1
n.=ne of Mild values related to Normal = 3
r= no of Cool values related to Normal = 1
S=ptnt+n=1+3+1=5
Therefore,
I(Normal)
=l(p,n,r)
-_1)
=
I(weak)
= —Flog,®
1
3)
3
yiOb2g — 5 i0B25
= AMD, 1, 1) = VB
- “log.”
4
,
- “log2"
2
5 Obes
ccecceesssssneseee using calculator
|
|
4
|
4
|
7
i
¥ Handcrafted by BackkBenchers Publications
reer
cree
Page 25 of 162
ag
)
bes
Scanned by CamScanner-
WWww.TOPPersSolut; ong
Chap - 3 | Learning with Trees
¢
Therefore,
~
Humidity
Distinct
total
related | (tota
values | (total
from Humidity | values of Hot)
High
3
Norma
1
z
Pi
values
rite
related | (total
Information Gan ce.
related | Information Gain
pan
values of Cool) r;
0 f Mild)
fy.
ie Mat)
eo
|
;
;
nF
1371
fe
3
PN
71
1.371
—™~
ae:
Now we will find Entropy by Humidity as following,
ay.
Pitty
.
Entropy of Humidity
= vhaseer x (pi Ni Mi)
Here,
p+n+r= total count of Hot, Mild and Cold from reference attribute = 10
Pi+n+r, = total count of related values from above table for distinct values in Humidity attribute
I(p,. ny) = Information gain of particular distinct value of attribute
.
iligg,
— Pitnytr; for weak
Entropy
of Humidity
= oer
Xe
Entropy of Humidity = =
x 1371+
=“
Pitnj+r; for strong
nyt) + TX
WN
M(pp>. nM)
x 1371
Entropy of Humidity = 1.371
Gain of wind = Total Information Gain - Entropy of wind
= 1.522 - 1.371 = 0.151
Gain of wind = Gain of humidity = 0.151
Here both values are same so we can take any one attribute as root node.
If they were different then we would have selected biggest value from it.
Q5.
Create a decision tree for the attribute “class” using the respective values:
Eye Colour
Hair Length |
Class
Brown
Married
Yes
| Sex
Male
Long
Feotball
Blue
Brown
Yes
Yes
Male
Mate
Short
Long
Football
Brown
No
Female
Long
Netball
Brown
Blue
No
No
Female
Male
Long
Long
Netball
Brown
Brown
No
No
Female
Male
Long
Short
Brown
Brown
Netbali
Football
Yes
No
Female
Female
Short
Long
Netball
Netball
Football
“|
Football
Blue
No
Blue
Male
No
Long
Football
| Male
Short
Football
Ans:
_—|
fom | Dec!él
Finding total information gain I(p, n) using class attribute,
There are two distinct values in Class which are Football and Netball.
Here,
p= total count of Football =7
Handcrafted by BackkBenchers Publications
.
_———
Page 26°
f 102
Scanned by CamScanner
chap -3| Learning with Trees
Wwww.ToppersSolutions.com
n= total count of Netball =5
s=p+n=7+5=12
Therefore,
-
_P
Pp
n
?
= — log. = - 5 log. =
\(2,.)
-_7
=
—Slog, 7s
a
I(D, 1) = 0.980 one
7 B25 5
USING Calculator
Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute
1.
Eye Colour:
Wind attribute have two distinct values which are Brown and Blue. We will find information gain of these
distinct values as following
I.
Brown =
p, = no of Football values related to weak = 3
n,= no of Netball values related to weak =5
S=ptn=3+5=8
Therefore,
=
\(Brown)
_
Pp
P
n
n
= I(p, n) = — Flog. = = = logs =
=“5 —3Iog,2
— Slog,
0825
B 0825
=
(Brown)
I.
Kp, 1) = 0.955 ween:
using calculator
Blue =
p. = no of Football values related to Blue=-4
n, = no of Netball values related to Blue
= 0
S=ptn=4t+0=4
Therefore,
-_?P
Pm)
ik
(Blue) = I(p, n) = —flog2> — = loge >
4
== — slog.
54_
HO occu
Sip,
Fi log 27
Fanyone value is 0 then the answer will be 0 for Information gain
Therefore,
[ Eye Colour
Distinct values from |
Eye Colour
(total related values of | (total related values of | Information Gain of value
1(Q, 723)
Netball)
Football)
Pi
Brown
ie
3
ny
5
Handcrafted by BackkBenchers Publications
0.955
Page 27 of 102
Scanned by CamScanner
www. TOPPerssolusi
Chap - 3 | Learning with Trees
Ms.
Now we will find Entropy of Eye Colour as following,
Entropy of Eye Colour = har
x I (pp i)
i
=12
p +n = total count of Football and Netball from class attribute
Here,
oti
i
e Colo
;
Pj + nm = total count of related values from above table for distinct values in
Ey
Our attridy
1 (pj, %) = Information gain of particular distinct value of attribute
Entropy
;
= Pi+n, for Brown
of Eye Colour
; 1(py nj) of Brown
+
Pitn,
for Blue
ea
;
x T(pir ty)
of Blue
ptn
“a8 x 0.955
+ "x0
Entropy of Eye Colour
= 0.637
Gain of Eye Colour = Total Informat
ion Gain - Entropy of Eye Colour =
0.980 - 0.637 = 0.343
2.
Married:
Married attribute have two distinct values which
distinct values as following
I.
are Yes and No. We
will find information gain of the:
Yes =
Pi = no of Football values rel
ated to yes =3
Ni = no of Netball values rel
ated to yes =]
HS prn=3t+ ]s
Therion
I(Yes) = I(p, n)= — flog, ® - = log
.”
= —log,?
= zlog: +
I(Ves) = Mp, n)= 0.872 oo cece.
".
using calculator
No=
D, = no of Focthall values related to No = 4
ny = no of Netbail vaiues related to No =4
S=pPptn=4+4=8
Therefore,
(No) = I(p, n) = —*log,® - Flog.”
=—
Therefore,
“too,
4
g
0825
4 ou, 4
3 525
= 1 veseeeasneemneene AS both value are 4
which is same so answer will be
]
[ Married
Distinct values from | (total relat
ed values of
Married
Football)
(total related values of
Netball)
Pi
Yes
3
No
4
® Handcrafted by BackkBenc
hers Publications
ni
Information Gain of value
(pny
ae )e y
Page 28 of 102
Scanned by CamScanner
www. ToppersSolutions.com
chap-31 Learning with Trees
Now we will find Entropy of Married as following,
aries
entropy of Married =
Here,
yj
Yi.
Phen
an
x (pny)
pt n= total count of Football and Netball from class attribute = 12
Pin, = total count of related values from above table for distinct values in Married attribute
K(py M) = Information gain of particular distinct value of attribute
‘
_ Pitny, for ves
7
Entropy of Married =
& Mani) of Yes + terse
,
— 3t1
ar
ptn
444
* 0.812 4+ ——
,
y I(p,.n,) of No
t
x1
Entropy of Married = 0,938
Gain of Married = Total Information Gain — Entropy of Married = 0.980 - 0.938 = 0,042
3. Sex:
Wind attribute have two distinct values which are Male and Female. We will find information gain af these
distinct values as following
lL
Male =
p, = no of Football values related to Male = 7
n, = no of Netball values related to Male =0
s=ptn=7+0=7
Therefore,
(Yes) = I(p, n) = —* log,” _ *log,~
_
7,
7
0
=—sloge7 — jlogz>
0
(Yes) = I(P, 1) = O ween
i.
Female
USING Calculator
=
P, = No of Foothail values related to Female = 0
n, = no of Netball values related to Female =5
s=pit+n=O0+5S=5
Therefore,
(No) = I(p,n) = —2 log,2 — Slog.
0
0
5
5
= — loge = = 5 O82 5
SO) scsasassesserenrenies If both values are same then the answer will be | foLInformatien
gain
Therefore,
Sex
ere
tet et arene amare
Distinct values from | (total related values of | (.otal related values of | Information
Gain ofvalue
:
Sex
Football)
Netball)
Py
Male
Female
Tp,
ni)
mh
7
0
O
5
oO
ee
_[o
—.
oo
re
Page 29 of 102
Scanned by CamScanner
ad
“|
www. Topperssotyy:
7
9)
with Trees
Chap -3] Learning
(
ing,
ropy of Sex as follow
Now we will find Ent
Entropy of Sex = ve, oipen x (pi Ni)
attribute = 12
Netball from cla ss
p+n-=total coun t of Football and
dist inct values in Sex attribute
from above table for
Pisn,tFnj = total count of relat ed values
ct value of attribute
rmation gain of particul ar distin
Here,
I(p;.n)) = Info
Pi+n,
Entropy of Sex = —
for Male
_ 7+0
x I(pieni) of Yes +
x04%O+S
Pi+n; for Female
x
(pi
na)
of No
pin
xo
Entropy of Sex = 0
- 0 = 0.980
Gain of Sex = Total Information Gain - Entropy of Sex = 0.980
4.
Hair Length:
Hair Length attribute have two distinct values which are Long and Short. We will find information g3
these distinct values as following
Lang =
p, = no of Football values related to Long
=4
n, = no of Netball values related to Long =4
S=prtm=4+4=8
Therefore,
I(Long) = I(p, n) = —Flog,® - "log."
= ~tlog,! - logs?
I(Leng) = I(p, n) =D= een
AS bUtTH Values are 4 which
is same so answer will be1
Short =
p = no of Foctball values related to Short =3
n, =no of Netbali values related toS hort =1
S=p+n=3+1=4
Therefore,
\(Short) = I{p, n) = —Flog.® ~ “log. ~
= — Slog, = - + log: +
= 0.812
Therefore,
Hair Length
Distinct values from | (total related values of | (total related values of | Information Gain of value
Hair length
Football)
Netball)
I(p.n;)
Di
nj
t
t
Long
4
4
short
s
1
** Handcrafted by BackkBenchers Fublications
]
0.812
:
—
__
Page 20 of 10?
all
Scanned by CamScanner
_
chap - 31 Learning with Trees
———
www.ToppersSolutions.com
_
Now we will find Entropy of Hair length as following,
.
~
yk
Piemy
Entropy of Hair Length = xt
Here,
x Mp ny)
P+N= total count of Football and Netball from class attribute = 12
Pitn, = total COUNT of related values from above table for distinct values in Hair Length attripute
I(py. ni) = Information gain of particular distinct value of attribute
=
Entropy of Hair Length
Pl+ny for Lomg
eel
pt+n
x 1(pi,nj) of Long +
Pitny for Short
pin
x 1(py nj) of Short
ayata x 1+ =3+1 x0812
= 0.938
Entropy of Hair Length
Gain of Hair Length = Total Information Gain - Entropy of Hair Length
= 0.980 - 0.938 = 0.042
Gain of all Attributes except class attribute,
Gain (Eye Colour) = 0.343
Gain (Married) = 0.042
Gain (Sex) = 0.980
Gain (Hair Length) = 0.042
Here, Sex attribute have largest value so attribute ‘Sex’ we be root node of Decision tree.
Ce)
tema
First we will yo for Male value to get its child node,
Now we have to repeat the entire process where Sex = Maie
We will take those tuples from given data which contains Male as Sex. And construct a table
of those
tuples.
: “Eye Colour
|
Married
Sex
Brown
Yes
Male
Long
Football
Biue
Yes
jale
Short
Football
Brown
Yes
Male
Long
Football
Blue
No
Male
Long
Football
Brown
No
Male
Short
Football
Blue
No
Male
Long
Football
Blue
No
Male
Short
Football
Hair Length
—
Class
Here we can see that Football is the only one value of class which is related
to Male value of Sex class.
“Owe can say that all Male plays Football.
Now we will go for Female value to get its child node,
“We will take those tuples from given data which contains Female as Sex. And construct a table of those
tuples
a
—
gigs [laNdcraftea by BackkBenchers Publications
Page 31 of 102
Scanned by CamScanner
www.To PP&FSSOlution. ‘
“Sy
Chap - 3] Learning with Trees
Eye Colour
Married
Sex
| Hair Length |
a
Brown
No
Female
Long
Netba
Brown
No
Female
Brown
No
Female
Long
Netball
Brown
Yes
Female
Short
Netball
Brown
No
Fernale
Long
Netball
-——Tong.—«|~_—Nettballl
pt
clac.
Here we can see that Netball is the only one value of class which
is related to Female value of f SexSex clas.
So we can say that all Female plays Netball.
So Final Decision Tree will be as following
4
(
Eye
Married
Sex
Brown
Yes
Blue
Yes
Brown
Blue
Yes
No
No
No
Brown
Blue
Blue
<>
ee
\
Class
Male |
Long
Male |
Short | Football|
Male
Male
Male
| Male
Cent
M fale
SNye
Hair |
Length
Colour
i
Eye | Married
Colour
Brown
No
Brown
No
Brown
No
Brown
Yes
Brown
No
Football
| Long |
| Long |
| Short |
| Long |
Football
Football
Football
Football
No___| Male | short | Football
Sex
Female
Female
Female
Female
Female
|
|
|
|
|
Hair
Length
Long |
Long |
Long |
Short |
long |
Class
Netball
Netball
Netball
Netball
Netball
_
(
Football
tee
Q6.
For
the
given
data
>)
.
determine
the
entropy
after
classification
using each attribute fo!
classification separately and fine which attribute is best as decisiun attribute
for the root by
finding information gain with respect to entropy of Temperature as reference attribute.
Sr.No
| Temperature
Wind
Humidity
7
Hot
Weak
Normal
2
Hot
Strong
High
3
Mild
Weak
Normal
4
Mild
Strong
High
5
Cool
Weak
Normal
6
Mild
Strong
Normal
7
Mild
Weak
High
8
Hot
Strong
Normal
9
Mild
Strong
Normal
10
Cool
Strong
Normal
Ans:
‘® Handcrafted by BackkSenc
hers Publications
Page 32 of
102
=
Scanned by CamScanner
—a——_
cha
www. ToppersSalutions.com
p-3| Learning with Trees
cirst we have to find entropy of all attributes,
1
Temperature:
There are three distinct values in Temperature which are Hot, Mild and Coal
As there are three distinct values in reference attribute, Total information gain will be Hp, nm. r)
Here, p = total count of Hot = 3
n= total count of Mild =5
r= total count of cool = 2
s=ptn+r=3+5+2=10
Therefore,
Ip,n.t)
_
Pp
P
n
n
= —Zlog.® - “log,* — log,*
3.
= 3
~
19 082 10
5
5
10 O82 35
I(p, 1, 1) = 1486 wee
2
2
~ 75982 75
USING Calculator
Wind:
There are two distinct values in Wind which are Strong and Weak.
As there are two distinct values in reference attribute, Total information gain will be I(p, n)
Here, p = total count of Strong =6
n = total count of Weak = 4
s=p+n=6+4=10
Therefore,
(0, n)
=P
=
—*log2~Bot
— ylog2>n
=-*}Jog,% - +log,+
=~ 75108279 ~ 39 082 to
WD, 1) = O97 1 oe ceeeseeecereeee as value of pandn
are same, the answer will be 1.
Humidity:
There are two distinct values in Humidity which are High and Normal
As there are two distinct values in reference attribute, Total information gain will be I(p, n).
Here, p = total count of High = 3
n= total count of Normal = 7
S=p+n=3+7=10
Therefore,
I(p, n)
= —?log.® - “loge =
3
3
70 O82 —10”
Se.
—
7)
10
8275
—/0¢,—
(1) = 0.882 vaeccsssnenenies as value of pand n are same, the answer will be 1,
fieercestee by BackkBenchers Publications
Paye 33 of 102
Scanned by CamScanner
www: TOPPerssoluy,,
1
8, om
Chap ~ 3 | Learning with Trees
.
é re as ref erence attribute
Now we will find best root node using TTemperatu
Here, reference attribute is Temperature.
4 cool
:
7
Mild and
There are three distinct values in Ternperature which are Hot, Mil
As there are three distinct values in reference attrib
Here,
Cool. .
ute, Total inforrnation gain will be I(p, n,r),
p= total count of Hot =3
nN = total count of Mild =5
r = total count of cool = 2
S=p+n+r=3+5+2=10
Therefore,
l(p,n,r)=
Pp,
)
—2Iop,2
clogs —— 2= log2 Sa
77
zlog, Ll=
_
3
3 ~ 2Igp,%
5
5 ~ Zoe,
2
=~
A]op,2
%2
190
O82 75
WPL
10 log, 10
1) = 1.486 cece
10 loge
10
USING calculator
Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute
1
Wind:
Wind
attribute have two distinct values which
are weak
and strong.
We will find information gain of these distinct values as following
= =Weak=
P. = no of Hot values related to weak = 1
ni = n0 of Mild values related to weak = 2
ri = no of Cool values related to weak =1
S=prtaAt+n=l+2+1=4
Therefore,
(weak) = Hip, n, r) = —2 log, — “log,” — *log,*
=—Liop,2
=~
710825
Z10p,2 —
70825 —
I(weak) = I(P, 0, 1) = 1S wee
I.
Liop,2
7827
USING Calculator
=—- Strong=
pi = no of Hot values related to strong =2
ni = no of Mild values related to strong =3
r, = no of Cool values related to strong =1
S=pitn+n=2+3+1=6
Therefore,
l(weak) = I(p, n, r) -_pF= —* log. * -- * logs ~ - “log: ~
==—2Iog,2
Iop,32 ~— 510825
Lge, 2
610825 —— 519825
weak) = I(p, ny 1) = 1.460 wcccccccssees using calculator
_
—_—
Handcrafted by BackkBenchers Publications
4
Page 34 of 10!
Scanned by CamScanner
er
www.ToppersSolutiens.com
chap-3| Learning with Trees
therefore,
Wind
vi
Gain of value
related | Information
values of Cool) 7
values of Mild)
values of Hot)
from Wind
related | (total
related | (total
values | (total
Distinct
pe nit
i
i
ny
Weak
Strong
Now we will find Entropy of Wind as following,
ienper
Purneery
Entropy of Wind = 3%,
Here,
ptner
x
1%,
ny, nm)
p+n+r=totalcount of Hot, Mild and Cold from reference attribute = 10
Pumer, = total count of related values from above table for distinct values in Wind attribute
Ip.)
Entropy
of
= Information
Wind
=n
ptn+r
arene
=
2h
obtttl
TUN
Entropy of wind
gain of particular distinct value of attribute
Plenytr, for weak
eRe
EOE
Jac
24341
LS + =
p
per strong
4g, Perer
ea
A
.
x (pp nen
|
& 1.460
=1476
Gain of wind = Entropy of Reference
2.
ay
X 1p NT)
- Entropy of wind = 1.486 - 1.476 = 0.01
Humidity:
Humidity attribute have two distinct values which are High and Normal.
We wil find information gain of these distinct values as following
lL
High=
Pp. = no of Hot values related to High =1
n= no of Mild values related to High = 2
r =no of Cool values related to High = G
S=pemtn=1+2+0=3
Therefore,
High) = Kp, n, 9) > — log, * - “logs > ~ “log. *
1
1 — 2I09,~
2
2 — “lop:
0
0
= —)o¢,1
3 B23
3 Ob
3 log25
I(High) = Ip, n, 1) = 0.919 we
Using calculator
I. = Normal =
D = no of Hot values related to Normal
= 2
n, = no of Mild values related to Normal = 3
r = no of Cool values related to Normal = 2
Se pentns2+3+2=7
Therefore,
(Normal)
= Ip, n,n) = —2log,® — Slog,> — <log.~
2
z
= -Flog.= — =log> — Sloga>
2
Mweak) = 1p, 0, 1) = 1-557 .ncesnnmnnns USING calculator
8 Handcrafted by BackkBenchers Publications
Page 35 of 102
i fii.
Scanned by CamScanner
www.ToOPPeFsSolutio,,,
"
Chap - 3 | Learning with Trees
c
~
:
Therefore,
Humidit
(total
Distinct —
related | (total
related | (total
values of Cool) ri
values of Mild)
from Humidity | values of Hot)
Di
related
informe
;
I
Gain oF DS
(Pi.Mu ri)
nj
High
1
2
Normal
2
3
|
t
1
Now we will find Entropy by Humidity as following,
Entropy of Humidity = re
Here,
\
x (pp nit)
c
total count of Hot, Mild and Cold from reference attribute = 10
above table for distinct va lues in Humidity attribute
p+n+r=
Pienj+r, = total count of related values from
= Information gain of particular distinct value of attribute
i (vy)
Entropy of Humidity = =
%
4¢
_
Pitnytr,
for weak
Pitnj tr; for strong
Ioynntis: ee
Entropy of Humidity = ue
x 0.919 + ar
x
a
1(piny.%)
x 1.557
Entropy of Humidity = 1366
Gain of wind = Entropy of Reference - Entropy of wind = 1.486 - 1.366 = 0.12
Gain of wind = 0.01
Gain of humidity = 0.12
Here value of Gain(Humidity) is biggest so we will take Humidity attribute as root node.
Q7.
For a SunEurn
dataset given below, construct a decision tree.
Location
Class
Light
No
Yes
Tall
Average
Yes
No
Brown
Short
Average
Yes
No
Sushma
Blonde
Short
Average
No
Yes
xavier
Red
Average
Heavy
No
Yes
Balaji
Brown
Tall
Heavy
No
No
Ramesh
Brown
Average
Heavy
No
No
Swetha
Blonde
Short
Light
Yes
No
|
Height
Name
Hair
Sunita
Blonde
Average
Anita
Blonde
Kavita
Weight
[10M - May17 & Maylé
Ans:
First we will find entropy of Class attribute as following,
There are two distinct values in Class which are Yes and No.
Here,
p =totalcount of Yes=5
n = total count of No=3
s=p+n=5+3=8
Therefore,
‘ ;
oy Handcrafted by BackkBenchers Putlications
:
page 36 of 10?
Scanned by CamScanner
Chap - 3 | Learning with Trees
www.ToppersSolutions.com
ten) =—Bloge? ~ Sopa?
=—lo9,*%
— 2Ioe,2
=~ 4 !0B25 ~ 70825
1D, MN)
= 0.955 vneeseeereenseeins using calculator
Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute
1.
Hair:
Wind attribute have three distinct values which are Blonde, Brown and Red. We will find information gain
of these distinct values as following
1.
Blonde =
p.= no of Yes values related to Blonde =2
n, = no of No values related to Blonde =2
S=Ptn=2+2=24
“herefore,
(Blonde) = I(p, n) = —log,? - zlog2*
= —log2s
— Flogs>
Blonde) = I{p, n) = 1 ou...
Ik
as Value ofp and nis same, so answer will be1
Brown =
©
= no of Yes values
related to Brown
= no of No values related to Brown
=0
= 2
5 = ptrn=O0+2=2
Therefore,
1
~\
?
no,
2
(Blue)
=— I(p, n) == — Slog.
>p — Flog2s
fa
= —Slogz5
2
—
0
= logs=
0
=O vccecses cseeneeereees If anyone value is 0 then the answer will be O for Information aain
Ii
Red=
p. = no of Yes values related to Ped =1
n, = no of No values related to Red = O
S=pitn=1+0=2
Therefore,
‘
_
=P]
PL
“log
LA
Blue) = U(p,n) = —Floge> — 7'082 5
1
= —jlos27
1
-
0
logy
o
HO covcccccssssssneesnieereIfanyone value is 0 then the answer will be O for Information gain
“—_—
* Handcrafted by BackkBenchers Publications
Page 37 of 102
Scanned by CamScanner
www.TOPPerssoluys
Chap - 3 | Learning with Trees
Ns,
Therefore,
[ei
Distinct
values from | (total related values of
Football)
Eye Colour
Value
Gain iofvan—~
ia.
linformation
(total related values of | Inform
Netball)
ni
Pi
"
2
2
Blonde
Brown
2
0
Red
7
O
re
[0
Now we will find Entropy of Hair as following,
Entropy of Hair = Ee
Here,
x (py ni)
p+n-= total count of Yes and No from class attribute = 8
Pp. +n, = total count of related values from above table for distinct values in Hair attribute
(yn)
_
Entrop
y
of
= Information gain of particular distinct value of attribute
Eye
Colour
=
P
——
r
nde
x 1(pj, nj) of Blonde +
r
Pitn) For
Pion for
p+n Brown x 1p. n;) of Brown + —
py.
vn
(py. nj) of Red
_ 242
Sar
2+0
2+
Entropy of Hair
xO0+
140
=
* 0
=05
Gain of Hair = Entropy of Class - Entropy of Hair = 0.955 - 0.5 = 0.455
2.
Height:
Height
attribute
have three distinct values which are Average, Tall and Short. We will find informatie
gain of these distinct values as following
=
Tall=
p. = no of Yes values reiated to Tall = O
n, = no of No values related to Tall = 2
S=p+n=O+2=2
Therefore,
_
Pp
{(Averace) = I(p, n) = — slog.
Pp
n
7 y loge
n
= —
90,9 — 269, 2
=~
710822 ~ 710825
Average)
I.
= I(p, n) = 0 wee
If any value from p and nis 0 then answer will be O
Average =
p. = no of Yes
values related to Average = 2
n, = no of No values related to Average = 1
S=p+n=2+1=3
Therefore,
(Average) = I(p, n) = —*log,®
- “log,”
ly
= — 2]—$log,> — Slog.
=1
= 0.919
Handcrafted by RackkBenchers Publications
Page 38 oF 10
,|
Scanned by CamScanner
chap-3 } Learning with Trees
oo
www.ToppersSolutions.com
short =
HE
4
"
a
a)
a
£
= no of No values related
6
nm
rr
o = no of Yes values
related to Short =}
geptmsl+2=3
Therefore,
Short} = itp, nm) = — log,
t
= — slog,
L
==
= 2 tog
rn
2
>
slog:
= 0.919
Therefore,
sight
|
Ea?
=
=
ay
pistinct values from
| ay ee
T
| (total related values of | (total related values of | Information Gain of value
| Height
| Yes)
*
t
f
-
i No)
|
Dy
|
ts
{1
Ip, 1)
ny
|
|
|2
.
12
[ql
0
0.919
[2
0.919
Now we will fing Entropy of Height as following,
Entropy of Height = We,i=l et
Here,
p+n=total
x I(p;,1;)
count of Yes and No from class attribute = 8
pj + nj = total count of related values from above table for distinct values in Height attribute
I(p,.n;) = Information gain of particular distinct value of attribute
_
Entropy
:
of
Pien,
Heiaht=
forTal!
SX
=
.
Mp n;) of Tall +
Pitn,
oe
Ar
u
Fi:
X 1(9,,n,) of Average + =
aH
z
x
ifp..n,) of Short
=
xo+2
0+2
8
xo9194
2
8
x6.919
8
Entropy of Height = 0.690
Can of Height = Entropy of Class - Entropy of Height = 0.955 - 0.690 = 0.265
<<
3. Weight:
Weight attribute have three distinct values which are Heavy, Average and Light. We will
find information
93in of these distinct values as following
1,
Heavy =
Pp: = no of Yes values related to Heavy =1
n. = no of No values related to Heavy = 2
S=ptn=1+2=3
Therefore,
\(Average) = I(p, n) = —thg:
1
= —jlog25
1
2
- 5 los25
2
(Average) = I(p, n) = 0.919
ed hy BackkBenchers Publications
Page 39 of 102
Scanned by CamScan
ne.
Wwww.TOPPersSolutia,
Chap -3 | Learni
ng with Trees
Se
ea
I.
Average =
Pi
NO Of Yes values related to Average = 1
"=
NO of No values related to Average = 2
Spt
njsl+2s%
therefore,
Average)
=I(p,n) = —Flog,®
=——
= “log =
I
1
2
2
ylones
~ 7 ORF
= 0,919
I,
Light =
Pi = no of Yes values related to
Light =1
nis
no of No values related
to Light =1
a= Prmsl+1=2
Therefore,
= -—1iog,1 _ 1)
208s 3 — 3 !082
FT
ccs
Nie
K(Light) = I(p, n)= —f Pylog, 2p ~ n hop, 2n
7
9625
5 Be
AS Value of p and nare same, so answer will be]
Therefore,
[ Weight
Distinct values from
{total related values of | (total related values of
Yes}
No)
Weight
ae
rz
1(pi.ni)
n;
—
_|
|
Information Gain of value
|
Fiecavy
q
2
0.919
|
Average
1
2
0.919
Light
1
1
]
Now we will find Entropy of Weight as following,
P
Entropy of Weight = Xx, i
Here,
p+
n=
x 1(p, nj)
total count of Yes and No from class attribute = 8
p, + ny = total count of related values from above table for distinct
values in Weight attribute
(py nj) = Information gain of particular distinct value of attribute
Entropy
of
Height=
Prem forneery
x Hpi, nj) of Heavy + Phen forarerage
x I(p
nj) of Average + Pu
Jortiot
x
T(py nj) of Light
=
x0.919+
2
xo9194 44 x1
Entropy of Height = 0.94
Gain of Height = Entropy of Class - Enuopy of Height = 0.955 - 0.94 = 0.015
Ee
%@ Handcrafted by BackkBenchers Publications
Page 40 of 10
Scanned by CamScanner
Chap -3| Learning with Trees
Wwww.ToppersSolutions.com
—_—
Location:
4
Location attribute have two distinct values which are Yes and
No. We will find information gain of these
distinct values as following
1.
Yes =
p= no of Yes values related
to Yes = 0
n= no of No values related to Yes
=3
s=ptm=O+3=3
Therefore,
Average) = I(p,n) = —Plog,2
— Ziog,2
s
s
s
bz
TseRieeccsecstrase if any one value from
p and
nis 0 then
answer
will be O
p. = no of Yes values related to No=3
in
n, = no of No values related to No =2
S=pt+tn=3+2=5
Therefore,
Average) = I(p, n) = —flog,®
s
3199,3
= “log.”
s
s
—ZIop,2
5 0825
~ 5 Ob2s
= 0.971
Therefore,
| Location
7]
| Distinct values from
(total related values of | (total elated values of | Information Gain of value
Location
Yes)
No)
|
Pi
(pp, ay)
Ny
Ves
0
3
re:
3
2
_
C
0.971
Now we will find Entropy of Location as following,
:
“ntropy of Location =
Here,
.
Pitn
io
ptr
x 1p Mu)
p+n-=total count of Yes and
No from class attribute = 8
», + n= total count of related values from above table for distinct values in Location attribute
t
tT
= Information gain of particular distinct value of attribute
I(p,.n))
ie E
Entropy of Location = ES
= 93
8
Pi+n; for No
x Mpn
mi) of V85F pn
(Pim) of No
yo4 24 x 0,971
38
Entropy of Height = 0.607
Cain of Heiaht = Entropy of Class - Entropy of Height = 0.955 - 0.607 = 0.348
é= —
by BackkBenchers Publications
Page 41 of 102
Scanned by CamScanner
4
www: TOPP®'SSolutig,.
“
‘
Chap - 3 | Learning with Trees
Here,
Gain (Hair) =
0.455
Sain (Height) = 0.265
Gain (Weight) = 0.015
Gain (Location) = 0348
Ve
Here, Wwe: can see that Gain of Hair attribute is highest So
e will take Hair
attribute as rool node,
.
oo.
i te eas
as followiing,
value of Hair: attribu
Now we will construct a table for each distinct
ng
Bionde -
Name
Height
| Weight
Location
Class
Sunita
Average
Light
No
Yes
Average
Yes
No
Anita
Browns
Tall
Sushma
Short
Average
No
Yes
Swetha
Short
Light
Yes
No
-
Name
Height
Weight
Location
Class
Average
Yes
No
“Kavita
; Short
Balaji
Tall
Heavy
No
TNo
Average
Heavy
No
No
Ramesh
>
7
acre
Red -
™
Name
Height
Weight
Xavier
Average
Heavy
Location
‘No.
Class
Ves
As we Can see that for table of brown values, cl ass val
ue is No for all data tuples
and same fer table of Ré
values where class value is Yes for all data tuples.
So we will expand the node of Blonde value in
decision tree.
We will now use following table which we have
constructed above
for Blonde value
,
Name
Height
Weight
Sunita
Anita
Average
Tall
Sushma
Short
Swetha
Short
[Light
Average
Average
Flight
ee
& Handcrafted by BackkBenchers Publleating
a
Page 42 of 10?
Scanned by CamScanner
N
chap ~3| Learning with Trees
www.ToppersSolutions.com
ee
Now we will find Entropy of class attribute again as following,
There are two distinct values in Class which are Yes and No.
Here, P= total count of Yes = 2
n= total count of No = 2
+224
gzptn=a
Therefore,
p,m)
= Plog, ® - Slog."
:
-slop
2
VlOh2 =3
179, NY ED
—
.
“-lop,=
q
>
ebay
tence ~ As value of p and n are same, so answer
will be]
Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute
1.
Height:
Hoght attribute have three distinct values which are Average, Tall and Shart. We will find information
gain of these distinct values as following
1.
Tall=
p= no of Yes values related to Tall=0O
n=
no of No values related to Tall=1
S=p+n=O+17=17
Therefore,
K{Tall) = 1p, n) = —flog.*
SETS
= Ups TAY SO
i
ik
n
— Flog,*
n
serccseesessienenen _ Many value from p and nis © then answer will be 0
Average =
p = no of Yes values related to Average
= 1
n. = no of No values related to Average = 0
S=p+ne)+0=)
Therefore,
Average) = (p,m) == — Plo
7 loge58 5 —— "log,7
ples
If any value from pand nis O then answer will be 0
MN
Short =
to Short = 1
p =no of Yes values related
n. = no of Na values related to Red =1
S = p+
mel+1=2
Therefore,
ers Publications
fted by BackkBench
Page 43 of 102
Scanned by CamScanner
d
Chap - 3 | Learning with Trees
www.ToppersSolutions.,
(Short) = I(p, n) = —?log.® - “log. ~s
=—+log,1
— 3)
7 0825 — 7/0825
=D veinsAS Value of p and n are same, answer will be|
Therefore,
Height
Distinct values from | (total related values of | (total related values of | Information Gain of value
Height
Yes)
No)
di
‘Tall
“Average.
Short,
1(pun)
ni
oO
7
0
1
0
0
]
1
]
Now we will find Entropy of Height as following,
Entropy of Height =
yk, a
~
Here,
p+ns
x 1p, nj)
pen
total count of Yes and No from class attribute = 4
py + n, = total count of related values from above table for distinct values in Height attribute
(py. nj) = Information gain of particular distinct value of attribute
.
Entropy
:
of
Height=
_
Pi¢n; for Tall
—n =~
7
I(p.nj) of Tall +
Pitn, for Average
an
;
aye
(pin) of Average +
Pitn, for Short
an
\
I(py, ny) of Short
_ oF
=
xO+
1+0
=
1+1
~0+——
xi
Entropy of Height = 9.5
Gain of Height = Entropy cf Class - Entropy cf Height =1-0.5=0,5
2.
Weight:
Weight
attribute have three distinct values which
are Heavy, Average and Light. We
wili find informatic
gain of these distinct values as foilowing
L.
Heavy =
p, = no of Yes values related to Heavy = O
0
n, = no of No values related to Heavy =
s=ptrn=O+O=0
Therefore,
50825 7
50825 —— “log
Heavy) = I(p, n)== —Plog,®
r
O.
= HP, N) =O crsesrsnnene AS Value Of p and n are
IL.
Average
=
rage = 1
p,=noof Yes values related to Ave
n, = no of No values related to Average = 1
ss ptmeltt=2
Therefore,
—a
* Handcrafted by BackkBenchers Publications
Page 44 of 1°
v4
Scanned by CamScanner
chap-3 | Learning with Trees
www.ToppersSolutions.com
—_—_—.
_
Average) = I(P,) = —Flog,# —
n
¥ a
5 O82
-|
__1
lo
\
oe zlogz5 ~ slog,-
yi.
soos AS Value Of p aNd
Nn are same.
Light =
p, = no of Yes values rela
ted to Light =1
n, = no of No values rela
ted to Light =1
g=pitm=l+1=2
Therefore,
(Light) = I(p, n) = —Flog.® ~ “hogs”
Therefore,
Weight
Distinct values from | (total related values of | (total related values of | Information Gain of value
Weight
Yes)
No)
|
yh
(pp ni)
ny
Heavy
0
O
oO
Average
1
]
]
Light
1
1
1
Now we will fing Entropy of Weight as following,
Entropy of Weight = YE, 22"
x M(p,.1)
pn
Here,
p +n =tocal count of Yes and No from class attribute = 8
p, + n;, = toral count of related values from above table for distinct values in Weight attribute
I(p,, n;) = Information gain of particular distinct vaiue of attribute
Entropy
of
Weight=
Plan forteetY
pin
y tp. nj) of Heavy +
p
=
.
Pp.
Pitny for Average
SX
1 (pyr)
of
Average
+
Sten; for Light
ptn
x
'(p.n,) of Light
=
4
ygp
tt
4
L+1
xit—
‘
x1
Entropy of Weight = 1
of Weight =1-1=0
Gain of Weight = Entropy of Class - Entropy
3. Location:
Location attribute have two distinct values which are Yes and No. We will find information gain of these
distinct values as following
Yes =
related to Yes = Oo
Pp, = no of Yes values
to Yes = 2
Nn, = no of No value s related
S=pt+n=0+2=2
rafted by BackkBenchers Publications
Page 45 of 102
Scanned by CamScanner
tig
Chap - 3] Learning with Trees
*
.
WA,
chi
—_
y
Therefore,
et
n
(Yes) = I(p, n) = —Zlog,> - “log:5
p and nis o then answer will beg
= — Slog, + — log2
.
Yes) = 1p, n) = O vensunsennsnen if any ONE V
I.
qlue from
Ni
No=
Pp, = no of Yes values related to No = 2
n. = no of No values related to No=9
S=p+n=2+O0=2
Therefore,
I(No) =
-
I(p, n)
pn
__P
= —<logz > _-
4
n
A
= 1082 5
ft
1
= — 2 Jog, 7 —*log2*
sf
‘
= O vies
will be O
: one value from p and nisi O thena nswer :
if any
Therefore,
eel
Location
(total related values of
Distinct values from |
I(p;,n;)
No)
Yes)
Location
(total related values of | Information Gain of valus
Pi
Yes
nj
0
No
0
Now
we will find Entropy cf Location as following,
:
Pitn;
Entropy of Location = Ek, a
x 1(pj.ni)
Here, p + n= total count of Yes and No from class attribute = 4
ri +
n; = total count of related values from above table for distinct values in Location attribute
I(x,4) = Information gain of particular distinct value of attribute
Entropy of Location =
O+2
=—
4
Pi+n; for Yes
pin
2+0
x0+—
4
Pisin,
% A (piu ni) of Yes + —ilerne
P+n
x (pi, ni) of No
x0
Entropy of Location = 0
6
Gain of Location = Entropy of Class - Entropy of Location =]
=1-0=]
Here,
Gain (Height) = 0.5
Gain (Weight) = 0
Gain (Location) =1
As Gain cf location is largest value, we will tale
.
€ location
Now we will construct a table for each distinct Value
e% Handcrafted by BackkBenchers
Publications
of
at,;
attribute
Se
erat
Splitting node.
© Of Locati on attribute as
following,
7
ee
,
_ 66°
Scanned by CamScanner
chap - 3] Learning with Trees
pe
wer. Topperssolutions.com
eA
Yes~
a
“Short
"4
amt
erage SSC Wes
Tall
anita
=
—
ign
Helgi
ame
[ Sivetha
nen
eee
Light.
7
~
—TVYes.
No-
Sushma
Yes
No
Average
| ‘Shot
a
|
a
I Class
io
‘Light
Average.
PSunita
-
Location
| Weight
Height
Name
—
—
—_
As we can see that class value for Yes (Location) is No and No(Location) is Yes. There is no need
—F
to da
further classification,
Ihe final Decision tree will be as following
Hee
apna
by
|
|
Blonds
}
Brown
\
(
~~
Yes)
1
hed
7
———,
wea)
{
TT
(no)
fom)
crafted by BackkBenchers Publications
\
|
so
tl
(ne
;
{
es
)
Page 47 of 192
Scanned by CamScanner
4
sy
www.ToppersSolirtions.c,_
Chap ~ 4] Support Vector Machines
CHAP
- 4: SUPPORT VECTOR
MACHINES
Ql.
What are the key terminologies of Suppert Vector Machine?
Q2.
What
is SVM?
Explain
the following
terms:
hyperplane,
separating
hyperplane,
margin ane
support vectors with suitable example.
Q3.
Explain the key terminologies of Support Vector Machine
Ans:
[5M | Mayl6,
SUPPORT VECTOR
Decl6 & Mayr;
MACHINE:
1.
Asupport vector machine is a supervised learning algorithm that sorts data into two categories
2.
Asupport vector machine is also known
3.
Wis trained with a series of data already classified into two categories, building the model as it is initig:
as a support vector network
(SVN).
trained.
An SVIM
5.
SVMs
cutputs
are
used
a map
of the sorted
in text
data with the margins
categorization,
image
between
classification,
the two as far apart as possible
handwriting
recognition
and
in tr
sciences.
HYPERPLANE:
1.
Anhyperplane is 4 generalization cf a plane.
SVMs ere based on the idea of finding a hyperplane that best divides a dataset into two classes/groups
Figure 4.1 show's the example of hyperplane.
4
y
.
j
2
|
es
Figure 4.1: Exemple of hyperplane.
example, for a classification task with only two features
as shown
in figure 41, you ca"
4.
Asasimple
S.
think of a hyperplane as a line tnat linearly sepaiates and classifies a set of data.
When new testing data is added, whatever side of the hyperplane it lands will decide the class that
we assign to it.
SEPARATING
HYPERPLANE:
1.
From figure 4.1, we can see that it is possible to separate the data.
2.
Wecan
3,
Allthe data points representing men will be above the line.
4.
Allthe data points representing women
5,
Sucha line is called a separating hyperplane.
1.
Amargin
2.
closest points
The margin is calculated as the perpendicular distance from the line to only the
use a line to separate the data.
will be below the line.
is a separation of line to the closest ciass points.
& Handcrafted by BackkBenchers Publications
‘
san
4102
i‘
Page 48 9!
Scanned by CamScanner
Chap - 4| Support Vector Machines
www.ToppersSolutions.com
3, Agood margin is one where this separation is larger far both the classes.
4,
Agood margin allows the points to be in their respective classes without crossing to other class.
5. The more width of margin is there, the more optimal hyperplane we get.
SUPPORT VECTORS:
1.
The vectors (cases) that define the hyperplane are the support vectors.
2.
Vectors are separated using hyperplane.
3. Vectors are mostly from a group which is classified using hyperplane.
4.
Figure 4.2 shows the example of support vectors.
xi
SUPPOF Vectors
Margin
Width
Figure 42. Example
Q4°
Define
Support
Vector
Machine
(SVM)
of support vectors.
and
further
explain
the
maximum
separators concept.
margin
linear
[10M | Dec17]
SUPPORT VECTOR
MACHINE:
1.
Asupport vector machine is a supervised learning algoritn
m that sorts data into two categories.
A support vector machine is also Known as a su eport vector network
(SVN),
~
Itis trained with a series of data already classified into two
categories, building the rnodel as it is initially
trained.
An SVM outputs a map of the sorted data with the margins
between the two as far apart as possible,
SVMs are used in text categorization, image classification, handwriting
recognition and in the
sciences,
MAXIM AL-MARGIN
\
CLASSIFIER/SEPARATOR:
The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM
works in practice.
Ww
The numeric input variables (x) in your data (the columns) form an n-dimensional space.
if you had two input variables, this would form a two-dimensional space.
Bt
For example,
Oo
A hyperplane is a line that splits the input variable space.
In SVM, a hyperplane is selected to best separate the points in the input
variable space by their class,
Sither class O or class 1.
l
‘90-dimensions you can visualize this as a line and let's assume that all of our input points can be
ely separated by this iine.
4
fted by BackkBenchers Publications
Page 49 of 102
3
Scanned by CamScanner
2
i
www TOPPerssolutions.c.
Chap > 4 | Support Vector Machines
™%
7%
Porasaniple
(1
(ane)
tA) hsp s Oo
Ho
Whete (he coefficients (Ay and Ao) that deterrnine the
slope of the fine and the intercept (8.) are fa.
by the earning algorithiny, and Xpane X, ate the two Input variables
8 OYotrcan rake chisstfiedtions using tis tine
10, AY plugaing in Input valiies Inte the fine equation, you can ealeuate whether @ N&w Point |s above,
bolow
(he tine
I
first class
Above the line, the eruation returns a Value greater than Gand the paint belangs ta the
18
Helaw the tine, The equation feturns a value less Tan O and the point helongs to the second class
1A
AVvaltie close ta the tine returns a Velie close to zero and the paint may be difficult to classify
li Ihe ragnitideof the value is large, the model ray have more confidence in the prediction
I)
The distance between the line and the closest data points is referred Co as the margin
Ib
The best oroptimal line that can sepandite The lwo classes 16s the fine that as the largest margin
I
This is called the Maximal-Margin hyperplane
IK
The margin be ealetiatectas (he perpendicular
19,
Only these
points are relevant
distanee from the line to only the closest points
in-clefining the line and
in the construction
of the classifier
20, These points are called the suppert vectors
“1
They stupporter
define the hyperplane,
2?
The hyperplane is learned from training data using an optimization procedure
that maximizes tre
Maren,
Figure 4.4
een
et
Qs.
Q6.
oN
What
ENE
Maximum
rnargin linear separators concept
ae
OLEATE
is Support Vector Machine
(SVM)? How
to compute
the margin?
Whiat is the goal of the Support Vector Machine (SVM)? How to compute the margin?
Ans:
[1OM | May?7 & Mayl8]
SUPPORT VECTOR MACHINE:
Refer Q4 (SVM Part)
MARGIN:
w
no
Armargin is a separation of line to the closest class points,
The margin is calculated as the perpendicular distance from the line to only the closest points.
A good margin is one where this separation is larger for both the classes,
4.
AgooUd margin allows the palnts to be in thei respective classes without crossing te other class
5.
The more width of margin is there, the more optimal hyperplane we get,
¥ Honderafted by BackkBonchors Publications
Page 50 of 10? ‘
Scanned by CamScanner
ye
chap-4 | Support Vector Machines
EXAMP
FOR HOW
uilding an
9
TO FIND MARGIN:
,
.
.
i
:
a,
2N SVM SVM over
the (very little) data set shown in figure 4.4 for an example like this
b
ansi
consider
5
www.ToppersSolutions.com
Figure4.4
1
The MsxiImumM
>
Jrneopt
margin weight vector will be parallel to the shartest line connecting
points of the two
hrough. So, the SVM decision boundary is:
a
oO
”
eo
Qi
uw
n
7
DO
m
Cx
Q
oO
trl
Es
}
Sion
surfs
:
;
'
:
The optumal decision
surface
is ortncgonal te that
line and intersects
it at the halfway
point.
2.
y=
4
xX; + 2x
—~55
Workng algebraicaily, with the standard constraint that, we seek to minimize.
This ha‘spRENs when this constraint is satisfied with equality by the two support vectors.
6
Further we
know
that the solution
is for some
m
Therefore a=2/5 and b=-11/5.
wo
€ optimal hycerplane is given by
& = (2/5,4/5)
Ano b> = -11/5.
Phe
Margin
DouNcary
Is:
Th
wn
2/'| = 2/ 4/25 + 16/25 = 2/(2V5/5) = V5
answer can be confirmed gecmetrically by examining figure 4.4.
meee,
Write short note on - Soft margin SVM
97.
Ans:
[10M | Dec18]
SOFTMARGIN SVM:
&
'
Sof margin is extended version of hard margin SVM.
2
Hard margin given by Boser et al. 192 in COLT and soft margin given by Vapnik et al. 1995.
3
Hard margin SVM can work only when data is completely linearly separable without any errors (noise
Sr outlers).
n case of errors eitne: the margin is smaller or hard margin SVM fails.
On the ot her hand soft margin SYM was proposed by Vapnik io solve this problem by introducing
crafted by BackkBenchers Publications
Page 51 of 102
=
ee
Scanned by CamScanner
“ys
Chap - 4| Support Vector Machines
www.Topperssolutions.c,, :
~
6.
lis
Ac for as their usage is concerned since Soft margin is extended version of hard margin SVM so we,
Soft margin SVM.
7
The allowance of softness in margins (i.e. a low cost setting) allows for errors to be made while fit.
%.
the rnodel (support vectors) to the training/discovery data set.
Conversely, hard margins will result in fitting of a model that allows zero errors.
3.
Sarnetimes it can be helpful to allow for errors in the training set.
JO.
It may produce a more generalizable model when applied to new datasets.
il.
Forcing rigid rnargins can result in a model that performs perfectly in the training set, but is possity
applied to a new dataset.
over-fit /less generalizable when
IZ.
yi.
Identifying the best settings for ‘cost’ is probably related to the specific data set you are working
13.
Currently, there aren't many good
parameters
4.
(if using a non-linear
In both the soft margin
and
solutions for simultaneously optimizing
cost, features, and kerr,
kernel).
hard
margin
the
case we are maximizing
margin
between
supps-
vectors, Le. minimizing Fir
relaxation to few points.
1S.
In soft margin case, we let our model give some
16.
If we consider these points our margin might reduce significantly and our decision boundary will be
poorer.
as support vectors we consider them as error points.
17.
So instead of considering them
1B.
And we give certain penalty for them which
is proportional to the amount
by which each data poir:
is violating the hard constraint.
19.
Slack vanables e, can be added to allow misclassification of difficult or noisy examples.
20.
This variables represent the deviation of the examples from the margin.
21,
Doing this we are relaxing the margin, we are using a soft margin.
Q8.
What is Kernel? How kernel can be used with SVM
to classify non-linearly separable data
Also, list standard kernel functions.
Ans:
T1OM | May!é}
KERNEL:
A kernelis a similarity function.
2.
SYM algorithrns use a set of mathematical functions that are defined as the kernel.
4
The function of kernel is to take data as input and transform it into the required form.
4.
|tisa function that you provide to a machine learning algorithm.
S.
ft takes two inputs and spits out how similar they are.
6,
Different SVM algorithms use different types of kernel functions.
‘
1
For exarnple linear, nonlinear, polynomial, radial basis function (RBF). and sigmoid.
EXAMPLE
ON EXPLAINING
SEPARABLE
1.
HOW
KERNEL
CAN
BE
USED
FOR
CLASSIFYING
NON-LINEAR!
{
DATA:
Te predict if a dog is a particular breed, we load in millions of dog information/properties like type
:
height, skin colour, body hair length etc.
aaron a en
@ Handcrafted by BackkBenchers Publications
page 52 0f 107 ©
Scanned by CamScanner
chap - 4 | Support Vec
tor Machines
O
,
2
3.
www.ToppersSolutions.com
—_
C
In ML language, these properties are
referred to as ‘features’.
Asingie entr
of the
yse
.
listo
j
f features isa data iinstance while the collection of everything is the Training
Data which forms the basis of
your prediction
Le. if you
know the ski
:
'n colour, body hair length, height and so on of a particular dog, then you can
predict the breed it will Probably belo
ng to
In support vector machinesInés, itj looks some
what
from red.
i
like
h n
show
Figure
6
acl
>
Therefore
the
}
i n figure
figure 4.54.5 whi which
separates
rat
the b |
Ml
4.5
o
+
+
+
5
.
se
hyperplane
of a two dimensional
space below is. a one dimensional
line dividing the red
and blue dots.
From the example above cf trying to predict the breed ofa particular dog, it goes like this:
Data {all breeds of dog)
fae
if we
wy
want
» Features (skin colour, hair etc.) » Learning algorithm
.
hy
a
m
i
j
j
to solve
following
example
in
Linear manner then itit j is not possible
to separate by straight
ine as we
did in above
steps
Figure
4.6
The red and blue balls cannot be separated by a straight line as they are randomly distributed.
Here
comes
Kernel
in picture
in machine iearning
a “kernel” 1s usually used to refer to the kernel trick, a method of using a linear
classifier to solve < non-linear problem,
It entails transforming linearly inseparable data like (Figure 4.6) to linearly separable ones (Figure 4.5).
The
kernel
function
is what
is applied
on
each
data
instance
to
map
the
original
non-linear
observations into a higher-dimensicnal space in which they become separable.
Using the dog breed prediction example again, kernels offer a better alternative.
Instead
between
of defining
a slew of features, you define a single kernel function to compute
similarity
breeds of dog.
You provide this kernel, together with the data and labels to the learning algorithm, and out comes a
classifier.
. Mathematical definition: K(x, y) = <f(x), fly).
Here Kis the kernel function,
x, yaren dimensional irputs.
.
f is a map from n-dimensicn to m-dimension space
< x,y > denotes the dot product. Usually m is much
v
larger than n,
Page 53 of 102
Handcrafted by BackkBenchers Publications
a
ee
ay
Scanned by CamScanner
‘
www.Topp
Chap - 4| Support Vector Machines
j
olutions.co,
hada
EL
SEE
ERNE
RENT
eee pen
ersS
the dy
19. Intultlen: Normally calculating <f(x), fly)> requires us to calculate f(x), fly) first, and then do
product
manipulations in m dimension,
20, These two Gornputation steps can be quile expensive as they involve
space, where fn can be a large number,
2),
of the dot product IS real,
Gutafterall the trouble of going to the high dimensional space, the result
ascalar.
22,
Therefore we come back to one-dimensional space again.
24
number
Now, the question we have is: do we really need to go through all the trouble to get this one
24, Da we really have to go to Lhe m-dimensional space?
25,
The answor is no, if you find a clever kernel.
26, Siraple
Exarnple: x = (xi, x Xd Y= (Yn Ye Ys):
Then for the function (Oo) = (XxX Xo, Xi, XX XX, XK KIX), MX, HX), the kernel is K(x, y) = (sx, yo}?
Let's plug in some numbers
27,
to make this more intuitive:
z=
28, Suppose «= (1, 2, 4). y = (4, 5, 6). Then:
Fd
(12,
4, 2,4, 6, 5, 6, 9)
F (y) © (16, 20, 24, 20, 25, 30, 24, 50, 36)
V(x), F (y) = = 16 4 40 4°72 4 40 4100 + 180 + 72 + 180 + 324 = 1024
29, Now let us use the kernel instead:
K (x,y) = (4 +10 + 1B)! = 324 = 1024
30.
Same
result, but
this calculation
Quadratic Programming
Q9.
is so much
easier.
margin separation in Support Vector
solution for finding maximum
Machine.
[1OM - May16]
Ans:
moclel is a very powerful tool for the analysis of a wide variety of problems in
The linear programming
1,
the sciences, industry, engineering, and business.
However, it does have its limits.
Not all pnenomena
‘
Once nonlinearities enter the picture an LP model is at best only a first-order approximation.
The next level of complexity beyond linear programming is quadratic programming.
a
4.
are linear,
This rnodel allows us to include nonlinearities of a quadratic nature into
the objective function.
4.
As we shall see this will be a useful tool
for including
the Markowitz
mean-variance
models
uncertainty in the selection of optimal port-folios.
B,
A quadratic program (QP) is an optimization problem wherein one either minimizes or maximizes?
quadratic objective function of a finite number of decision variable subject to a finite number of line"
inequality and/or equality constraints.
npn
pcan
9,
A quadratic function of a finite number of variables x = (x1, %2,..... Xn)" is any function of the form:
n
f(z) ==a
+
\
So
cya;
j=l
n
k=l
Ue jLp dy.
j=l
Using rnatrix notation, this expression sitnplifies to
10,
Pe
n
lea
“+ 52
em
ee
re tee
ee
ee
—
—_—_—
—
:
Page 54 of 102 E
Ponts
¥ Handcrafted by BackkBanchets Publications
Scanned by CamScanner
chap ~ 4 | Support Vector Machines
near ste A
wo
OA
ate
BRN
ENN
yeu Topperssolutions.com
ene
te
ta
asta
Nt}
=
a4
cn
wt Qs.
where
£4
e
ne
and
Q:
a
iL
Gi
ia
fie
™
_
™
Ini
Post
Goon
°
included
The factor of one half preceding the quadratic tern in the furnietion ¢
convenience since It simplifies the expressions for the first and second derrvatives of f
for
the
sake
of
2. With no loss in generality, we may as well assume that the reatriz © iSib syrimmetric since &
T
15.
Qn
T
yl
(2h 7 Oryh z set Qty
-
1,
,
tT
gt! Qs ert tr)
.
= 27f
gar
jx,
4
And so we are free to replace the matric O by the symmetric matriz
4Q7
i,
Henectorth, we will assure that the rmatriz Q is syrammetric.
form that we use is,
Q
pubjech
Io
Arb,
0
2, where A © I"
every quadratic
program
and & e IR.
cer
tr
form.
Observed
constant
that we
can
term a since
eer
Ql0.
to
Just as in the case of linear programing,
standard
17
4 be Ox
minimize
@
iho OP standard
Explain
rrr rt tr
how
have
sirnplified the expression
for the
object\ve
function
by
dropping
tne
jt plays no role in optimization step.
itr
support
tir
Vector
rs
Machine
can
oo
be used
to
noe
find optimal
hyperplane
to
classify
linearly separable data. Give suitable example.
Ans:
flom j Deci8]
OPTIMAL HYPERPLANE:
|
Optimal hyperplane is cornpletely defined by support
vectors
2
the optimal hyperplane is tie one which maximizes the margin of the training data
SUPPORT VECTOR MACHINE:
.
A support vector machine is a supervised learning algorithrn that sorts data into WG
two categaries.
2
Asupport vector machine is also known a5 a support vector network (S/N).
3,
Iris trained with a series of data already classified into two categories, building the model as it is initialh
trained,
4
AN SVM outputs a map of tne sorted data with the margins between the two as far apart as possible
*S
SVMs
are used
in text categorization,
image
classification,
handwriting
recognmion
and
in tne
sciences,
_ HYPERPLANE:
A byperplane is a generalization of a plane.
andcrafted by BackkEenchers Fublications
Page 55 of 102
Scanned by CamScanner
UR Oe ag
WALTa“ Op OPSSON
Machines
Chap - 4| Support Vector
_ee
Uataset Into
of finding ah yperplane that best divides a
SVMs are basedoan the idea
2
savernsiacaemnce tt set
ACOA
Yemen
+ ber
Ware igh
ir
Classtangres a
PAG
ae
—T
169
tat
tts
170
teh
0
ann
fhige (etn)
for a classification task with only two features
eo)
As a simple example,
When
voir ty
line that linearly separates and classifies a set of data
think of a hyperplane as a
4.
(like the intiage abtvey.
side of the hyperplane
is added, whatever
new testing data
it lands will Hecidle (Ne Clase fF
we assign to it.
5S.
Hyperplane: (w-x)+b=OweRN,beR
Corresponding decision function: f(x) = sgn((w > *) + b)
7.
Optimal hyperplane (maximal margin):
Maxw.e MIN=T,..gm{[]X — Xi] 2x ERY
(w+
x) + b = 0)
with y, € {-1, +1} holds: y, « ((w +x) + b) > O for alli =1,
Where,
am
w and b not unique
wand
bcan
be scaled, so that |(w>
xi) + b] = 1 for the x, closest to the hyperplane
Canonical form (w, b) cf hyperplane, now holds:
y.-((w-x) + b)21foralli=1,...,m
Margin of optimal hyperplane in canonical form equals =
Qi.
‘Find optimal hyperplane for the data points:
ia, Y,
(2, ), Qi, -1),
(2,
1), (4. 0;,
(5, i), (5, -1),
(6, 0})}
{10M - Bees}
Ans:
nN
Plotting given support vector points on a graph (In exam, you can draw it roughly in paper)
We are assuming that the hyperplane will appear in graph in following region
{
115
{|
:
4
°
6
| os
|
}
0
eee
-05 °
)
i
fe
1
a
8
§
|
Lj
|
6
;
|
6
7
|
|
|
o-1)
“15
|
|
{
'
;
‘
% Handcrafted by BackkBenchers Publications
°
Page 56 of 102
i
Scanned by CamScanner
=
v
grap -4 |Support
.
ector Machines
arse. Topperssolutions.com
ait use three SUDO vectore which
.
7
swectors which ate closer to ass imed regianon gram
we
auf
THCse
ET's
SUDPON
vectoras
Ss
G.
«14,
7. 32.2295 Shown
|
|
below
1
15
|
1
o5
or
a
ag
=
SF
o
)
re!
G
-
-
1
°
3
-
Ss
5
«$2
®
6
7
a
“15
2
of?
$=}
—
4
Sa ={",). Ss= (3)
assed on : these Suppori
vectors,
we 3will find augmented
B35
SUPPON vectors,
we
go Augmented
°\
vectors
i 2
\k
1
vectors by adding |as bias port
wili De.
4
V/
ewilliindgt
=
2 will find three parameters
a,, a>. a, based on following three linear equations,
)+ (a. x & * §)+(ayx
Hx H)=A
aX & x S)+ (aX HX §)+ (a,x
Hx Has
a,x 5x
S}> (a, x 5X
&) + (a3 X 3X
S)=)
eta neh ON
tm patna
YV\
Okb OC
b ()-Ce b=
G
B
h
A
)
b
B
be Bb()- eb G)-Ohh Gh
ubstitutine values of §, 5, & in above linear equations,
Simplifying these equations,
‘x {(2x2)+
x I+
1+
fay x ((2x2)+ x1) 4x
(az x (2% 2) + (1 XI) +X
DY © lay x (4X 2) + OK DEX
x [(2% 2) + (“1X -1) + LX DDN
DBS fa,
lags (214) + (1% 0) + CL DYE Cee (2D
tay MCP
204 (Ox DD
DB eA
DY ed
ODEN + Lay 1A) 4 OO) EX DYE
We get,
£
y+ Ga, + Sa, = -1
rafted by BackkBenchers Publications
aa
eee
EET ESN IS ERIN re
Page 57 of 102
se
i
Scanned by CamScanner
»
Machines__——
Chap - 4 | Support Vector1
4a, + Gaz + 9a3 =-]
9a, + Say + 1703 = 1
-
.
;
By solving
above equations
we get 1
To find hyperplane, we hav e to
=
ty
="?
=
6
_3as and 3 = 3°
ive class. So we will y., r fok,
« from negat
°
a
positlityive cl
discriminate
ss
equation,
w=
Y
ek
Putting values of a and § in above equation
<
v= foe) ‘pe
¥ = @ 5, + A252 + A355
We get,
=
t
get augmentedoO vectyeg
Now we will remove the bias point which we have added to support vectors tog
So we will use hyperplane equation which is y = wx + b
Here
w
= ( 0)
as we have removed the bias point from it which is -3
And b = -3 which is bias point. We can write this as b +3 =O also.
So we will use w and b to plot the hyperplane on graph
Asw
=
fl
(5):
oo
.
0
.
.
so it means it is an vertical line. Ifit were () then the line would be horizonta! line.
And b + 3 =0, soit means the hyperplane will go from point 3 as show
||
_
|
|
|
05
0
|
—
°
4
|
°
'
I
:
. | —
|
fe
}
| ore
Fa
|
1.
2
| as
°
=?
j
=
_
o——~-9
4.05
6
Gg
9
'
| 08 0
-1
below.
rr”. __O
4
UnanaAnvaftad
hv
BackkBenichers
“
pubtiewia. fo
Se
a
i
Scanned by CamScanner
cha
p-5! Learning with Classification
www.ToppersSolutions.com
Cc HAP - _e.5; LEARNING
WITH CLASSIFICATION
Explain
Ql.
with
suitab le
example
approaches to Probability,
“
the
advantages
Q2.
Explain Stassitication
g3.
Explain, in brief, Bayesian Belief network
of
Bayesian
approach
over
using Bayesian Belief Network with an example.
[10M j Decl6, Deci7 & Deci8]
ay
BAYESIAN BELIEF NETWORK:
1.
A Bayesian network is a grap
hical model
2.
It represents a set of variables and the dependencies between them by usin
g probability.
3,
The nodes in a Bayesian
4,
classiicai
of a situation.
network represent the variables.
The directional arcs represent the dependenc
ies between the variables.
The direction of the arrows show the direction of the dep
endency.
Each variable is associated with a conditional probability tabl (CPT)
e
.
CPT gives
the
probability
of this variable
for different
values of the
variables
on
which
this node
depends.
Using this model, it is possible to perform
inference and learning.
BBN provides a graphical model of casual relationship on which learning can be performed.
. We can use a trained Bayesian
. There are two components
network for classification.
that defines a BBN:
a,
Directed Acyclic Graphs (DAG)
b.
Aset of Conditional
Probability Tables (CPT)
. As an example, consider the following scenario.
Pablo travels by air, if he is on an official visit. If he is on a personal visit, he travels by air if he has money.
If he does not travei by plane, he travels by train but sometimes also takes a bus.
. The variables involved are:
a
Pablo travels by air (A)
b.
Goes on official visit(F)
9
(M)
Pablo has money
d.
Pablo travels by train (T)
e.
Pablo travels by bus (B)
This situation is converted into a belief network as shown in figure 5.1 below.
. Inthe graph, we can see the dependencies with respect to the variables.
The probability values at a variable are dependent on the value of its parents.
t on F andM.
. In this case, the variable A is dependen
The variable T is dependent onA and variable B is dependent on A.
The variables F and M are independent variables which do not have any parent node.
variable.
So their probabilities are not dependent on any other
_*
on F and M.
Node A has the biggest conditional probability table as A depends
Tand B depend anA.
Handcrafted by BackkBenchers Publications
Page 59 of 102
Scanned by CamScanner
www.ToppersSolutions,c,
¥
r
eM
Cher =f ! Learning with Classification
M
oo
in pocket
'
[ren
os |
bef
[fas money
Cipes ut
oltigial triph
.
]
nee
cei
P(A| Mand F)
*
7098
4
SS
0.98
om
Lakskhman
travels by
a0
air
———
J
f
Lakshman
Lakshman
leery :
»
feain
a
1
0,00
i
1,60
bur
J
-
:
2%
Pirst we take the independent
24
Node F has a probabilityof P (F) > O77,
24,
Hocde M has a probability of P (tM) = 0
Yh
We
ey.
The conditional probability table for this node can be represented as
nest
come
—
travels by
travels hy
A | PIBIA
(BA)
T |
0.00
F
0.40
nodes.
to node A
FIM [PALE MandPy
TT
“0.98
TLE
(1.98
FIT
| 0.907
FYI
28
0.10
The conditional probability table for T can be represented as
A]JPIA
Tioga
Toe
29. The conditional probability table for Bis
ALPILA
fl
oo
Lao
$0. Using the Bayesian belief network, we can get the probability of a combination of these variables.
31.
For example, we can get the probability that Pablo travels by train, does not travel by air, goes onan
official tup and has morney.
32.
In other words, we are finding P (1, 0A, F, M).
33,
The
probability
of each variable
given
its parent
is found
and
multiplied
together
to give the
probability,
P(T, ~A,M, P)
= P(T[ oA) + P(-A] FE and M) + P(F)« Piha)
= 0670.98 *0.7*°03=0123
# Handcrafted by Ba~kkBenchers Publications
Page 60 of 102
|
{
Scanned by CamScanner
chaP~ 51
Learning with Classification
:
as _ Expplain cla, ssificati
os
65.
www.ToppersSolutions.com
usii ng Back Propagation algorithm with a suitable exam
ple
Classification using Back Propag
ation Algorithm
o6-
ExpPelain . how Ba ck Prop
agatioi n
g7.
Write short note on - Back Pr
opagation algorithm
algorithm helps in classification
& May13}
[10M | May16, Deci6, Mayl7
Ans:
aCK PROPAGATION:
propagati
Back k propagation
isi a supervised
learning
algorithm,
for training
Multi-layer
Percey
tron
puc
tArtificts
Aslitic
Neural Networks).
~w
Back propagation algorithm locks for minimum value of error in weight space.
us
ip uses techniques like Delta rule or Gradient descent.
en
1
the error function is then considered
that minimize
The \ weights
;
to be a solution
BACK PROPAGATION
ALGORITHM:
1
a set of training
iteratively
process
tuple
and
compare
the network
prediction
with
actu.
«n
ie]
problem.
target value.
For each
training tuple, the weights are modified to minimize the mean
squared error ©
epween
tris
network's prediction and actual target value.
a
Modifications are made
in “Backwards” direction from output layer.
Through each hidden layer down to first layer (hence called “back propagation’).
wn
Tne weight will eventually converge and learning process stops.
BACK PROPAGATION
1
STEPS:
Initiate the weights:
a.
The weights in networks are initialized to small random
b.
Ex. ranging from -1.0 to1.0 or -0.5to 0.5
c,
Each unit has a bias associated with it
d.
The biases are initialisea to small random
numbers
numbers
Propagate the error:
c
layer.
The training tuple is fed to network's input
unchanged
The input is passed through the input units,
to its input value lj
For aninputj its output Gj is equal
d.
Given a
3.
b.
input |, to unitj is
unitj in hidden or output layer, the net
=
m
2
ALGORITHM
Here,
Wii * 45
from unit i in previous layer of unitj
Ww y = Weight of connection
previous layer
O,= Output of unit i from
£
j= Bias of unit
Given the net input |; to unitj, the O, the output of unit j is computed as
o,;=1/ (1+ e*)
s biarted by 3ackkBenchers
Publications
Page 61 of 102
Scanned by CamScanner
wwew.ToppersSolutions sc,
Chap ~ § | Learning with Classification
Oty
“.
3.
Back Propagat
the Error;
e
a
The errer propagated backward by updating the weights and oiases
b.
It reflect the error of network's prediction.
c
For unit) in output layer, the error Errj = Oj (1- Of)(Tj - Oj)
d.
Where,
©, = actual output of unity
T, sKnown target value of given tuple
Ol - ©) = derivate of logistic function
e.
The error of hidden layer unit J is
Err, = O, (l- O)) ¥ Err.Wy
4. Terminating Condition:
Training stops when:
a.
AILW,, in previous epoch are so small as to be below some specified threshold
b.
The percentage of tuple misclassified in previous epoch is below some threshold.
c.
Apre-specified number
Seren
of epochs has expired
ewan ern eee een n wenn nen wenn enee
-
Q8.
Hidden Markov Model
Q9.
What
Ql0.
Explain Hidden Markov Models.
aoe.
o-
cnwene.
=
wee ennew enn,
are the different Hidden Markov Models
Ans:
[10M | Mayl6, Decl6, Mayl7 & Dec17|
HIDDEN
1.
MARKOV MODEL:
Hidden
Markov
Model
is a statistical model
based on
the
Markov
precess
with
unobserved
(hidden)
states.
A Markov process is referred as memory-less process which satisfies
Markov property.
3.
Markov
property
states
that
the
conditional
pronability
distribution
of
future
states
of
a process |
depends only on present state, not on the sequence of events
4“.
The Hidden Markov Modei is one ot the most popular graphical model
S.
Examples of Markov process:
a.
Measurement of weather pattern,
b.
Daily stock market prices.
6,
HMM
7.
HMM is a variant of finite state machine have following things:
b.
An output alpnabet of visible state (Vj.
c.
Transition probabilities (Aj.
d.
Output Probability (B).
€,
initial state probability {41}
in es
Aset of hidden states (W).
The current state is not observable, instead €acn state
produces en output with a certain probability
1B).
9.
Usually states (Wj and outputs {¥j are understaod
“* Handcrafted by BackkBenchers Publications
Page 620f 107
nent at eat
a.
emeret
generally works on set of temporal data.
Bene
8.
|
|
i
Scanned by CamScanner
yy
chap
5] Learning with Classification
jons.com
www.Topperssolut
oan HMM is said to be a triple (A, B, 71)
10.5
LANATION OF hMM:
HMM ¢
1
onsists of two t
ypes of states, the hidden state (W) and visible state (V).
e
to another
_ Transition probability is the probability of transitioning from one stat
2
———————_""_
in single step.
.
ne
+
robability:
is i
1 emission
probability:
ability: The conditional
distributions
of observed variables
P(Vn| Zn)
,, HMM have to address following issues:
a. Evaluation problem:
e
atfora given model @with W( hidden states), V(visible states),
A, (transition probability) and Vv"
(sequence of visible symbol emitted)
«
What is probability that the visible states sequence V™ will be emitted by model 6
i.e. P (VT/ 8) =?
b.
Decoding problem:
*
W’, the sequence of states generated by visible symbols sequence V" has to be calculated.
ie. Wl =?
c.
Training problem:
Foraknown set of hidden states W and visible states V, the training problem is tc find transition
*
probability A, and emission probability By. from training set.
ie. Ay=? & Bus?
EXAMPLE OF HMM:
Coin toss
1,
Heads, tails sequence
4
4
You are in a room with a wall.
with 2 coins.
WwW
Person behind the wall flips coin. tells the resuit.
ey
Coin selection and toss is hidden.
Cannot observe events, only output(headgs, tails) from events.
Probjem
ll.
is then to build a model
of neads and
to explain observed sequence
tails.
Using Bayesian ciassification and the given data classify the tuple (Rupesh, M, 1.73 m)
Value
[Attribute
Probability
Count
Short
Medium
Tall
3
V4
2/7
3/4
]
3/4
S/7
V4
0
0
2/4
0
0
2
O
o
| 2/4
i)
(1.7, 1.8}
O
3
0
O
3/7
(1.8, 1.9)
0
3
O
0
3/7
(1.9, 2)
0
1
2
O
V7
2/4
(2, ©)
0
O
2
O
2/4
Short
Medium
M
FE
1
3
2
5
Height
(0, 1.6)
2
range)
(1.6, 1.7)
|
fGendar
Lo
OS
|9
=
[OM | Mayt6]
Ans:
MHandcratted
Tall
by BackkSenchers
Publications
Page 63 of 1C2
*
Scanned by CamScanner
}
www.Topperssolutions.c,,
Chap - 5 {| Learning with Classification
_
We have given Condition Probability Table
i
\
eee
(CPT).
.
|
h, M, 1.73m)
Tuple to be classified = (Rupes
as following to find prob
We will use sore part of table which is red boxed
ability of Snort,
ediury
At,
ee
Tall.
[Attribute
[Value
al ee
WRU
| Short
ee
erotarallil
—
Téountt
Medium
“Short
Tall
a
71
2
3
V4
ofr
M
F
3
5
1
3/4
5/7
Height
O
0
aa
o
2
Oo
(0, 1.6)
(16,17) | 2
0
(1.7, 1.8)
0
(1.8, 1.9)
=
0
5
'
WF
FR
:
(range)
6
3
3
5
5
Vi
ya
Gender
aa
(2, ©)
oO
oO
—T aq
5
7
a
0
2
oO
| 0
V4
~
a
.
| 2/4
(There is no need to draw the above table again. The above table is just for understanding purpose)
Here, n= 15 (surmming up all numbers in red box.)
Probability (Short)
= (1+3)/9 =0.45
Probability (Medium)
= (2+5)/9 =0.78
= (3+1)/9 =0.45
Probability (Tall)
Here, Short, Medium and Tall are class attributes values.
Now we will find probability of tuple with every value of class attribute which is Short, Medium and Tall.
Probability(tuple | Short)
= P[M | Short] x P[Height, Short]
Here, P[M ?} Short]
= probability of M with Short
ana P[Height | Short]
= probability of height with Short
Prohabrlityyte iple | Short} = P[Mi | short] « P[(I.7 - 1.8) | Short]
(As height in given tuple 's 173m which lies in (1.7 ~ 1.8) range of height.)
(Now we have ta get those values from Probability part of table)
Probability(tuple | Short) = - xO0=0
Probability(tuple | Medium) = P[M | Medium]
x P[Height | Medium]
= 2y¥ 53-04%
7
7
Probability(tuple| Tall) = P[M | Tall] x P[Height | Tall] =3 x 0=0
Now we have to find likelihood of tuple for all class attribute values.
Likelihood of Short = P[tuple | Short] x P[Short] = 0 x 0.45
Likelihood of Tall = P[tuple | Tall] x P[Tall] = 0 x 0.45=0
Now we have to calculate estimate of likelihood values as following,
Estimate
= Likelihood of Shert + Likelihood of Medium
+ Likelinood of Tall
=0+034+07=034
Handcrafted by BackkBencheir
Publications
Page 64 of 102
\
aa Pere eee teetnteeernrnenterenersanee
pepe teen ee
Likelihood of Medium = P[tuple | Medium] x P(Medium] = 0.43 x 0.78 = 0.34
Scanned by CamScanner
i
chaP
i
{|
_5| Learning with Classification
www.ToppersSolutions.com
wwe will find actual Probabili
ty for tuple u
a
PO 12) xB)
pl =~Po
si
Ng Naive Bayes formula as following
calculating actual probabili
ty for all class attribute
val alues,
_ short:
p(Short | tuple) = SCple| snore)
x p¢short) “V4
on oas =_
P(t(tuple)
[
2
Medium:
ey
P(Medium | tuple)=
(tuple | Medium) xp(Medium) = 0.43
X 0,78
,
———_ _ 0.34
P(tuple)
0.34
=034 51
3. Tall:
P(Tall | tuple)= PMupletratt) xpcrauy _ Oxnas_
P(tupley
“ta
~
gy comparing actual probability of all three values, we can say that Rupesh height is of Medium.
(We will choose largest value from these actual probability values)
Important
May
be
Tip:
some
time
Name
in
7
'
{C
jo
_
exam
they
can
ask
like
find
probability
of
tuple
using
Gender
Height
Class
Female
1.6m
Short
Male
2.0m
| Female
19m
Medium
| Female
185m
tedium
2.8m
Tall
' Maie
17m
Short
Male
18m
Medium
fe
Female _
16m
Short
| |
Female
1.65m
Short
E
—
7
io
data
-
Tall
Male
E
following
|
—
in our solved sum.
Now we have to construct a table as we have
and Tall.
There are three distinct values in Class Attribute as Short, Medium
Male and Female.
There are two distinct values in Gender Attribute as
There are variable values in Height Attribute there we will make
them
as range values. As from
lowest
(1.6m — 1.7m), and so on.
height to highest height. As (O 1.6m),
Attributes
Short
Gender
Probability
Values
Medium
Tall
Short
Medium
Tall
Male
Female
Publications
a Handcrafted by sgackkBench2? -s
Page 65 of 102
Scanned by CamScanner
www.Topperssolutions.co,
con
Chap - 5 | Learning with Classification
“Height
|
{O-1.6m)
(1.6m
nent
a,
nan
rene
=
17m)
(1.7m
a
fp
-
1.8m)
(L8m
—~—
-
19m)
a
(1.5m
2.0m)
|(2.0m - ©)
and Short) value.
By seeing given data, We have to find how many tuples contains (Male
Same for (Male and Medium) values and (Male for Tall) values.
The above step will also appiicable for tuples containing Female value
Now we will go through each touple and check their height value and find that in which range they com,
and increase the count for it in their range part.
Now we will find values for probability section
For probability of Gender values —
We will count all values for Short, Medium
There
is count of 4 for Short in Gender
and Tall as following,
part
Short
Male
1
Female
Total
3
| 4
~ |
For probabilityof Male = count of Male in Short part / count of all short values
=1 14
For probability of Female = count of Female in Short part
/ count of all short values = 2/4
There is count of3 for Medium in Gender part
Medium
Male
1
Female
2
Total
3
For probability of Male = count of Male in Medium p ar
/ coun
t t of all Medium values =1/
3
For probability of Female = count of Femal
e in Medium
pa/ rt
count
of all Medium value
s ==2/3
=”
4
~
There is count of 4 for Tall in Gender part
Tall
}Male
[2.7
Female
seh
Tota!
0
2
ee
ee
Handcrafted by BackkBenchers Publications
nt
Page 66 of 102
Scanned by CamScanner
y
oe
-5| Learning with Classi
ficat;
ston
a
For probabbabi
lity
ilii
ty of
ab
www.Topperssolut
ions.com
of Male
= count
of Male .in Tall part / count of all Tall val
ues = 2/2
For probability of Female = count of ¢ emale in Tali
part / count of all Tall values = 0/7 ao
4 = 0
zhe game Process Can be applie
d for Height part
now W e can fill the table of
Condititi
ion Probabiliility as following
l Atti
a
| Attributes
Vale
ues
ee
Short
po
|_ ————_
Gender
Male
Female
Medium — | Tall
Probability
|
Short
[Medium
i
1
1
22
1/4
3
>
0
3/4
ps
j1/‘~3
| 73!oe
j2f
| 2
IE
1
~
j_———
-
height
(O -1.6m)
(1.6m
2
=| 2
:
0
0
2/4
| 0
0
0
2/4
o
|
io
7.7m
9
|
1.7m)
|
-|0o
1
0
0
1/0
2
0
0
V3
|°
|
1.8m)
(8m
|
2/3
|°
1.9m)
(1.9m
|
-|o
0
(Zom-«) | 0
0
1
.
:
ve
2.0m)
¥ Handcrafted by BackkBenchers Publi
‘cations
1
0
|
7
=
PaqQe
|
67 of 102
Scanned by CamScanner
”
Www
Chap ~6| Dimensionality Reduction
CHAP
Sng
- 6: DIM
“Ans:
n Reduction
[10M — Mayle, Decig
INCIPAL COMPONENT ANALYSIS (PCA) FOR DIMENS ION REDUCT
a given
1.
Cy
CTION
ENSIONAL
ITYDimeREDU
ysis for
nsio
;
Explaina in detail Principa
l Component A nal
Ql.
ToPPerssoius.
&Se¢
ION:
data sets.
The main idea of PCA is
to reduce dimensionality
from
ag
ther either heavily or ligh
Given data set consists
of many variables, correlat
ed to each Ot”
t am aXxIML Im exte
;
tasetupto
Reducing is don while retaining
ve
2.
3.
the vari
ation present In da
iables
which are !an
e
*
The same is done by transforming variables
to a new set of varia
own as Pring,
Compon
“.
ents
n
(PC)
:
iginal
.
ioti
PC are orthogonal, ordered such that retentio
n of variation prese nt in origginal compo
Ponents decre.,
6.
So,
aS we
move
down
in this way,
in order.
the
first principal
component
will have
maximum
-
oe
variation
that
was Dreser,
orthogonal component
.
7.
PrincipalCom ponents are Eigen
vectors of covariance matrix and
hence they are called as Orthog,.
8.
The data set on which PCA is to be applie
d must be scaled
9.
The result of PCA is sensitive to
relative scaling.
10.
Properties of Pc:
a.
PC are linear combination
b.
PC are orthogonal.
c.
Variation present in PC's decreases as we move
from first PC to last PC.
of Original variables.
IMPLEMENTATION:
1)
1.
OA
2.
4.
Normalize the
data:
Data as input to PCA process must
be normalised to work PCA properly.
This can be done
by subtracting the respective
means from numbers in resp
ective columns.
If we have two dimensions X
and Y then for all X becomes xana Yb Ecomss yThe result give us a dataset whose
means is zero.
IN
Calculate covariance matrix:
1.
Since we have taken 2 dimens
ional dataset, the covariance mat
rix will be
2.
Matrix (covariance)=
.
ill) Finding
1.
Var(X1
ar(X1)
Cov(X2, X1)
aro |
Var(X2)
Eigen values and Eigen Vectors:
In this step we have to find Eige
n values and Eigen vectors for
covariance
2.
It is possible because it is square matrix.
3.
The Awill be Eigen value of matrix A,
4.
If it satisfies following condition: det(al - A)=0
5.
Then we can find Eigen vector
for each Eigen value A by cal
IV) Choosing
1.
components and formin
features Vectors:
We order the Eigen values from
highest to lowest.
2.
So we get components.in ordet
® Handcrafted by SackkBenchers Publicat
ion-
matrix
,
cu | ating: (al — A)v
=0
:
a
act!
Scanned by CamScanner
rei
_6 |
i
Dimensionality
Reductio
chap
data set have n variab
}
3 Ire
n
www.Topperssolut
ions.com
.
les then we will have n Eigen values and Eigen vectors.
Ej
:
:
We turns; out that
that Eigen
vector with
highest Eigen values is PC of dataset
we have to deci
.
Now
de how much Eigen values for further processing.
we choose first p Eigen values
and discard others
,
we do lose out some in
formation in this process,
But if Eigen values are small, we do not lose much
Now we form a feature vector.
Jo. Since we are Working on 2D data, we can choose either greater Eigen value or simply take both
1. Feature vector = (Eigl, Eig2)
vy) Forming Principal Components:
|, Inthis step, we develop Principal Component based on data from
previous steps.
2
We take transpose of feature vector and transpose of scaled dataset and multiply it to get Principal
Component.
New
Data = Feature Vector’ X Scaled Dataset?
New Data = Matrix of Principal Component
Q2.
Describe the two
methods for reducing dimensionality
Ans:
[5M | May17]
DIMENSIONALITY
1
Dimension
with
2.
REDUCTION:
reduction
refers to process of converting
lesser dimension
a set of data
having vast dimensions
into date
ensuring that it conveys similar information concisely.
This techniques are typically usec’ while solving machine learning problems to obtain better features
for classificaticn.
3.
Dimensions can be reduced by:
a.
Combining features using @ linear or non-linear transformation.
b.
Selecting a subset of features.
MES HODS
FOR REDUCING
DIMENSIONALITY:
)
Feature Selection:
1
[t deals with finding k and d dimensions
dimensions.
It aives most information and discard the (d — k)
3
It try to find subset of original variables.
4
tion:
There are three strategies of feature selec
a.
Filter.
b.
Wrapper.
c.
Embedded.
) Feature extraction:
c ombinations of d dimensions.
It deals with finding k dimensions from
2
I transforms the data in hiah dimensional space to data in few dimensional space.
3
transformation may be linear like PCA or non-linear like Laplacian Eigen maps.
The data
a
: ¥ Handcrafted by BackkBenchers
Publications
-
Page 69 of 102
Scanned by CamScanner
www.ToppersSolutions.c.
i
bE
Chap - 6 | Dimensionality Reduction
I) Missing values:
1
—
5
Given a sample training sample set of features.
2.
Among
3’.
;
sification p process.
‘The feature with more missing value will contribute less to class
“,
This features can be eliminated.
e
y MISSING
missing values.
the available features a particular features has many
IV) Low varlance:
1.
Consider that particular feature has constant values for all training sample.
2.
;%.
That means variance of features for different sample is comparatively less
;
:
15
This iimplies
that feature with
constant values or low\ variance
have less iimpact ton
© classsification
“.
This features can be eliminated.
eeene Aeon
Q3.
en ee ee
eee ee
ee
ee eee
eee nan eee esenenanaasaaae
7
What is independent component analysis?
Ans:
[5M | Mayl8 & Decigy
INDEPENDENT COMPONENT
ANALYSIS:
1,
Independent component analysis (ICA) is a statistical and computational technique
2.
Itis used for revealing hidden factors that underlie sets of random variables, measurements, Or Signais
3.
ICA defines a generative model for the observed multivariate data,
4,
Given data is typically a large database of samples.
5.
Inthe madel, the data variables are assumed
6
The mixing system is also unknown.
7.
The latent variables are assumed non-Gaussian and mutually independent,
8.
This latent variables are called the independent components of the observed data.
9.
These independent components, also called sources or factors, can be found by iCA.
10,
ICA is superficially related to principal component analysis and factor analysis.
I.
ICA isa rnuch more powerful technique
12.
However,
capable
of finding
the
to be linear mixtures of some unknown
underlying
factors
or
sources
when
these
latent variables
classic
methods
/ail
completely.
13.
The data analysed by ICA could originate from many different kinds of application fields.
14. It includes
digital
images,
document
databases,
economic
indicators
and
psychometric
measurements.
1S,
In many cases, the measurements aie given as a
set of parallel signals or time series
16. The term blind source separation is used to characterize this problem.
EXAMPLES:
1.
Mixtures of simultaneous speech signals that have been picked up by several
microphones
2.
Brain waves recorded by multiple sensors
3.
Interfering radio signals arriving at a mobile phone
4.
Parallel time series obtained from some industrial process,
AMBIGUITIES:
1.
Can't determine the variances (energies) of the IC's
2.
Can't determine the order of the IC's
Handcrafted by BackkSenchers Publications
Fage 70 of 102
4
Scanned by CamScanner
uction
chap ~ 61 Dimensionality Red
ee
jpPLICATION
DOMAINS
a
www.ToppersSolutions.com
OF ICA:
mage de-noising.
2
Medical signal processing,
3,
Feature extraction, face recognition,
A
compression, redundancy
6
Scientific Data Mining.
,
04
Use
Principal
matrix A.
P
Co
reduction,
:
:
MPponent analysis (PCA) to arrive at the transformed matrix for the given
AT=
ans
[10M | May17 & May18]
Formula for finding Covariance values =>
Covariance = Y7., eo)
worscunenarnnan{Valid for (x, y) and (y, x]
Here we have Orthogonal Transformation Av
We will consider upper row as x and lower row as y.
Here n = 4. (Total count of data points. Take count fram either x row or y row).
Now, we will find ¥ and ¥ as following,
X = (sum of all values in x / count of all values in x)
y = (sum of all values in y / count of all values in y)
gz
2+140+(-1)
_
2_
OS
Now we have to find values of {x - x), (y — ¥), [Ix - *) x(y
- y)], (x - *)4, (y— #7)?
7
y
(x-4)
y-y
| 2
4
15
1
3
O
5
|
[x= x)«(y-y)]
7]
a3
v-¥
1.875
2.8185
|
2.25
3.5156
0.5
0.875
0.4375
0.25
0.7656
7
-0.5
“1.125
0.5625
0.25
1.2656
05
5
“1.625
2.4375
2.25
2.6404
|
Finding following values, 2[(x - ¥)x(y —P)], TK - ¥)*, Ty -y)?
YU(x - 4) xy - y)] = 2.815 + 0.4375 + 0.5625 + 2.4375 = 6.256
5(x— x)? = 2.25 + 0.25 +0.25 + 2.25=5
+ 2.6404 = 8.1872
Yiy-F)P= 3.5156 + 0.7656 + 1.2656
values of (x, x)
Now we will find covariance
(x-%)?
Covariance(x, x) = Zizi)
Cevariancely,
4
-2 =1
73
ue@
n7
y)
y
)?
yyoH
_=
aia: re
_
2
.
=
_~on
riancely, xX) = ER.
“ovariance(x, y) = Cova
|“ Handcrafted by BackkBenchers
(x-¥) (VY)
6,256 _ 2.09
:
"poe potion
Publications
Scanned by CamScanner
www. TOPPeTsSolutj
ion
Chap - 6 | Dimensionality Rzduct
Ons.
¢
rix
these values ina 7 mat ix form a5 &
Therefore putting
i
.
x
s | = Covariance = [
2.09
Ye [38
components.
inci
cippal
Prin
two
be
i
there willale
ompo
these Pri ncipal components
oto find
S—
.
_,
.
;
As given data is in Two dimensiona
screen
os
Is
Now we have to use Characteristics Equation
ea
Here, s = Cavariance in Matrix form
.
"|
|= Identity Matrix= [s
Putting these values in Characteristics equation,
1.67
Ilo‘09
Getting
2.09)
2731
determinant
,71 O}
a [a HI
using
OF A208
2.09
above step,
mnmannsen (1)
2.73 -—A
(1.67 - A) x (2.73 -A) - (2.09)? = 0
-4.44+ 0.191=0
= 4.3562
2 = 0.0439
Taking A and putting it in equation (1) to find factors a), and a2 for Principal components
1.67 — 4.3562
2.09
2.09
ays)
273 4.3562) * ai2
°
[soo —rezeal * EET=°
-2.6862 X yy + 2.09 X O49 = O vecsersnesesnnnsn (2)
2.09 & 4, — 1.6262 X O49 = O or censsnssnnscrenn (3)
Let's find vatue for a,; (you can also take a,,) by dividing equation (2) by 2.09, we get
_ 2.6862
M12 =
9%
M11
Q,2 = 1.2853 X ay,
We can now use equation of orthogonal trans
formation relation which is
yg? yg? 2 srenanmasainse (4)
->
Substituting value of a,2 in above equation (4), we get
fayy? + (1.2853 x ayy)? =1
04,2 + 1.5620 X a7 =]
2.5620 x a,,7 =]
(447 = 0.40
Q1,= 0.64
ee
*? Handcrafted by BackkBenchers Publicatians
ita“
a
page
72 oO
Scanned by CamScanner
.
_6| Dimensionality Reducti
:
me
chaP
www.ToppersSolutions.com
—
putting a; in equation (4)
=1
oot a2”
94096 0+4.24"096=I
221-
ay2
I
'
ai
j
©
2= 0.5904
= 0.7684
a2
staking
Now taxing
— 0.0439
Ay and putting
ing
:
;
factors a,, and a, for Principal components
(1) to find
ititi in equation
%
[i a0), * 2.73
:
04301 7 [=] =O
(5 ator! * 2] =
£6261 X do, + 2.09 X Ogg =O occccseeesseen (5)
2.09 X ay + 2.6861 & Qo2 =O eccecccesssinne (6)
ty = 0.7781 X ayy
Substituting value of a,2 in equation (4), we get
doy? + (0.7781 X az1)? =)
do? + 0.6055 X ag,” =]
16055 X @2,7 =1
| a," = 0.6229
ay ,= G.7893
||
|
| PULting 2, In equation (4)
(0.7893)? + @z27 =1
0.6230 + az27 =1
a,* = ]- 0.6230
a)? = 0.377
a, = 0.6141
the Principal Components are
Therefore using values Of @11, 412+ G21) A225
SS am + anxz
| 22 0.64x, + 0.7684X2
|
| &= auxi + 422X2
| &=0.7893x, + 0.6141X%
. Handcrafted by BackkBencher
s Publications
Page 73 of 102
Scanned by CamScanner
{
}
WA
TOPP ETS Olu
.
—
c hap ~ 7\ Learning with Clustering
Ns
ee
CL
“ARNING. wiTH.
~
——
*
My
USTERING
a
for clustering analysts.
CHAP. 7. Lea
a algorithm
sultable oxample. Also, explain hoy, k.
Describe the essential steps of Kemea”
il
Ql.
Mh,
Explain K-moeans clustering algorithm “m
: clus torin
clustering differs from hierarchical
Q2.
[5 -10M | Doerg i
Sty,
Ans:
-
K-MEANS CLUSTERING:
algorithms that solve the Well-known ete,Ste
rnlng alt
joa
Jed
K-means is one of the simplest unsuperl
1.
oroble
problem,
2,
The procedure
to classilya given data 60set through
tO Gino
follows a siinple and easy W« ly
aace
certain ump
*
Er
aprior
clusters (assume k clusters) fixed
5.
cluster.
)
es
diff fo = ithe
fferent location Caus
The main idea isi ta defiine k centres, on ’ for cach
5
€
c
because of cliffe €
Way
be placed ina cunning
4
These centres should
5.
So, the better chaice
6
7,
8
is to place
them
from each other,
has
possiblef ar away
as po
as muc
as
neararest con
nt belonging to a given data set <ancl associate it to the
The next step is to take each poi
When no point is pending, the first step 1s completed andan early group age! is done.
ng g ¢ from».
resultiltin
rsrs resu
steste
ids as bar
é ycentre of the : cluclu
At this point we need to re-calculate k new centro
previous step,
9.
After we have these k new centroids, a new binding has to be done between the same data set Poin:
and
10.
the
nearest
new
center.
A loap has been generated
AS a result of this loop we may notice that the k centres change
their location
step by step unt.
more changes are done or in other words centres do not move any more
Finally, this
algorithm
aims at
minimizing
an objective function
know
as squared
error functi
given by:
Cc
Cs
2
JV)= i=]= j=)E (x-wp
5,
Where
7/x,- v//is the Euclidean distance between x and v,
‘c‘is the number of data points in 7" cluster,
‘c‘is the number of cluster centres.
ALGORITHM STEPS:
I
Let X= (Xi.X2Ksy.-0Xn} be the set of data points and V = (vivo... Ve} be the set of centres
2.
Randomly select ‘c'cluster centres.
3.
Calculate the distance between each data point and cluster centr
es,
4, Assign the data point to the cluster center whose distance from the cluster
center is minimu™ of
the cluster centres.
5.
Recalculate the new cluster center using:
wee)
Where,
F
7=1
x,
‘represents the number of data POiNts in 7
¢ uster,
a
Handcrafted by BackkBenchers Publications
Si
:
a
al
—
l
page
" Scanned by CamScanner
F
ha
o
p-7 [ Learning with Clustering
www.ToppersSolutions.com
ee
gecalculate the distance betwean ¢
6
;
ae
:
ino data point was rez SSigned the
fach data point
:
Land new obtained cluster centres
"Stop, otherwise repeat from step 3.
,pVANTAGES:
fast, robust and easier
to Understand
,
aoe
pelatively efficient: Olt
j s, k is clusters, dis
(tknd), where n is object
dimension of each object, and t is
iterations.
4
|
Ges best result when
data set are dis tinct or well separated
from each other.
piSADVANTAGES:
1, Applicable only when mean is de
fined i.e. fails for categorical data
7
Unable to handle noisy data
and outliers
Algorithm fails for non-linear data
set.
3,
DIFFERENCE BETWEEN K MEANS AND
HIERARCHICAL CLUSTERING:
i,
2.
Higrarchical clustering can’t handle big data well but K Means clustering can
This is because the time complexity of K Means is linear i.e. O (n) while that of hierarchical clustering
is quadratic i.e. O (n?).
w
In K Means Clustering, since we start with random choice of ciusters.
The results produced by running the algorithm multiple times might differ.
w
While results are reproducible in Hierarchical clustering.
ov
4.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere
in 3D).
7.
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into.
a
You
can
stop at whatever
number
of clusters you
find appropriate
in hierarchical
clustering
by
interpreting the dendrogram
93.
Hierarchicai clustering algorithms.
[10M | Dect7j
Ans:
HIERARCHICAL
CLUSTERING
ALGORITHMS:
Hierarchical clustering, also known as hierarchical cluster analysis.
2
Itis an algorithm that groups similar objects into groups called clusters.
3
The endpoint is a set of clusters.
xs
uw
~
1.
Each cluster is distinct from each other cluster.
The objects within each cluster are broadly similar to each other.
This algorithm starts with all the data points assigned to a cluster of their own.
cluster.
1. Two nearest clusters are merged into the same
&.
Inthe end, this algorithm terminates when there is only a single cluster left,
5.
Hierarchical clustering does not require us to prespecify the number of clusters
tic.
10. Most hierarchical algorithms are determinis
is of two types:
1. Hierarchical clusterin g algorithm
a
b.
tering algorithm.
Divisive Hierarchical clus
tering algorithrn.
Agglomerative Hierarchical clus
a
idcrafted by BackkBenchers Publications
Page 75 of 102
Scanned by CamScanner
IE E"'"SE"~C
ERR
NSIT
www.TopPersSolug; 6
As.
Chap - 7 | Learning with Clustering
Sy
Both this algorithm are exactly reverse of each other.
12.
Mor
DIVISIVE HIERARCHICAL CLUSTERING A-CORITH
gle
1.
Here we assign all of the observations to a 5/7
DIVISIVE ANALYSIS):
DIANA (DIVISNS
cluster an d then partition the cluster to 4, D te
similar clusters.
| there is one cluster for each observatig,,
the
.
nti
4
cluster
rchies
than
Agglom
Finally, we proceed recursively on each
a
er
hi
te
.
:
‘
.
e more accura
There is evidence that divisive algorithms produc
x.
le
mp
co
more
algorithms in some circumstances but is conceptually
.
i
3.
4.
‘Grae
Mi
Advantages:
a.
ete hierarchy.
More efficient if we do not generate a compl
b.
Run
faster than HAC
much
algorithms.
Disadvantages:
5.
a.
b.
Algorithm will not identify outliers.
Restricted to data which has the notion of a centre (centroid).
Example:
6.
a.
K-—Means algorithm.
b.
K-Medoids algorithm.
ALGORITHM
OR AGNES
(AGGLOMERATIVE
NESTING,
—_
HIERARCHICAL CLUSTERING
In agglomerative or bottom-up clustering method we assign each observation to its own cluster
N
AGGLOMERATIVE
Then, compute the similarity (e.g., distance) between each of the clusters and join the two most simi
clusters,
Ww
Finally, repeat steps 2 and 3 until there is only a single cluster left.
Advantages:
5.
6.
a.
No apriozi information about the number of clusters required.
b,
Easy toimpiernent and gives best result in some cases.
Disadvantages:
a.
No objective function is directly minimized
b.
Algorithm
c.
Sometimes itis difficult to identify the correct number of clusters by the dendrogram.
can
never undo
what was done
previously.
Types of Agglomerative Hierarchical clustering:
a
Single-nearest distarice or single linkage.
b
Complete-farthest distance or complete linkage.
c.
Average-average distance or average linkage.
d
Centroid distance.
e.
Ward's method - sum of squared Euclidean distance is minimized.
x
Handcrafted by BackkBenchers Publications
|
ee
——
page 76 of 10
;
Scanned by CamScanner
i cha ~ 7 | Learning with Clusteri
ng
www.ToppersSolutions.com
“a
oh
- What is the role of
radj lal basisi function
In se Parating nonlinear patterns.
4
Write short note
on - Radial Basis
functions
;
[1OM | Mayl8 & Decl8]
ns
gaDlUS BASIS FUNCTION:
|
p radial basis function (RBF) iis a real-valued function & whose value depends only on the distance
from the origin,
so that (x) = (|}x}])
z. Alternatively on the distance from some other point
c, called a centre
3, Soany function that satisfies the
Property is a radial function
4
|
5
The norrn is usually Euclidean distance, although other distance functions
are also possible.
Sums of radial basis functions are typically used to approximate given functions.
6
TNS APProzimation
7
&BEs are also used as a kernel in support vector classification
process can also be interpreted as a simple kind of neural network
mamonly used types of radial basis functions:
a.
Gaussian:
é(r) = ewer)"
b.
Multiquadratic
c.
Inverse quadratic
d.
Inverse Multiquadratic
alr) = ———
f
2
ylt (ery
“fiat
basis
function
network
is an
artificial
neural
network
that
uses
radial
basis
functions
as
clivation functions.
sutput of the network is a linear combination of radial basis functions of the inputs and neuron
parameters.
).
Radial
basis
function
networks
have
many
uses,
including
function
approximation,
time
series
control.
nrediction, classification, and system
.dial basis function networks is composed of input, hidden, and output layer. RBNN is strictly limited
3
as feature vector.
to have exactly one hidden layer. We call this hidden layer
Padial basis function networks increases dimension of feature vector.
sages layer
@ndcrafted by BackkBerchers
Publications
Regis dae ten tryer
Curent boner
Page 77 of 102
Scanned by CamScanner
www. TOPPErsSolutig,
Chap - 7 | Learning with Clustering
is for
the
i
Se
FOFEAG Lee INPUT pag,
saute an arbitrarctors
y Basicl,s C2,---,ch
¢ cor 4s
e
;
hat
14. The hidden units provide a set of functions
that
1S. Hidden units are known as radial c entres and repres
ented
16.
SP
Transformation from input space to hidden unit
by the ve
ce
is nonlinea
e
t
*
-
r whereas transformatin, :
fe
hidden unit space to output space is linear
'
.
eis
pl
17,
dimension of each centre fora pinput network is P
18.
uc es
The radial basis functions in the hidden layer P rodduc
ace.
'
.
.
-
:
response only y,
a significant non-zero
input falls within a small focalized region of the input sP
19.
Each
Ex
i
ive fi field in input $
hidden
unituni has iits own receptive
;
.
;
pace.
ed
20. An input vector xi which lies in the receptive field for cen
of weights the target output is obtained.
21.
.
would activate
¢j and by Proper cp.
C
:
The output is given as:
h
uy= >d.
Ow, ,
A=
(lle — call)
wr Weight of j" centre,
y: some
radial function.
22. Learning in RBEN Training of RBFN requires optimal selection of the parameters vectors ci andy;
;
Leech.
23. Both layers are optimized using different techniques and in different time scales.
24. Following techniques are used to update the weights and centres of a RBFN.
mel
a,
Pseudo-Inverse Technique (Off line)
b.
Gradient Descent Learning (On line)
c.
Hybrid
Learning
(On
line)
What are the requirements of clustering algorithms?
Ans:
[5M | Mayia}
REQUIREMENTS
OF CLUSTERING:
1.
Sealability: We need highly scalable clustering algorithms te deal with iarge databases & leaning
system.
2.
Ability to deal with different kinds of attributes: Algorithms should be Capable to be
applied onary
kind of data sucn as interval-based (numerical) data, categorical,
and binary data.
3.
Biscovery of clusters with attribute shape: The Clustering algorithm should
be capable of detecting
clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find
spherical cluster of small sizes.
4.
High dimensionality: The clustering algorithm should
not only be able to handle
low dimension!
data but also the high dimensional space.
5.
6.
Ability to deal with noisy data: Databases contain Noisy, missin
9 Or erroneous data. Some algavit”™
are sensitive to such data and may lead to poor quality cluster
S
Interpretability: The clustering results 5should be j
be interpret
able, com Prehensible, and
usable.
Handcr>fted by BackkBenchers Publication,
ons
Scanned by CamScanner
y-
ap
e
‘
_-7| Learning with Clustering
|
www.ToppersSolutions.com
Apply K-means
algorithm on
centres.
_
given
|
-
data for k=3. Use C,(2), C2(16) and C;(38) as initial cluster
|
Data: 2, 4 6, 3, 31,12, 15, 16, 38, 35,14, 21, 23, 95 39
1
’
’
'
,
Ans:
ay16 & Decl6]
[10M | May
number of clusters k=3
initial cluster centre for C= 2,C,=16
we WI ill check distance between
Cy= 38
;
data Points
and
all cluster centres. We will use Euclidean
.
Distance
formula for finding distance.
| pistance [x, a] = /(— xa)?
OR
+ (y—bye
pistance [(x, y), (a, b)} = fx =a)
As given data is not in pair, we will use first formula of Euclidean Distance.
Finding Distance between data points and cluster centres.
We will use following notations for calculating Distance:
{
D, = Distance from cluster C;centre
D2 = Distance from cluster C; centre
D; = Distance from cluster C;centre
; 2
=0
(2, 2) = {@&— a}? = {2-2}?
D,(2,16) = {@ =a) = (@— 16)? =14
D,(2, 38) =
f(x — a)? = J (2 — 38)? = 34
Here 0 is smallest distance sc Data point 2 helcngs to ()
2)? =2
|
D4, 2) = fOr — ay? = (4
|
= 12
=
— ay? = (416)?
D;(4, 16) = (Ge
D.(4, 38) = (@ — a)? = V4 — 38)? = 34
:
2
point 4 belongs to C;
Here 2 is smallest distance so Data
(6—2)*=4
D.(6, 2) = fe — a)? =
= /(6—
D,(6, 16) = V(x — aj?
Ds(6, 38) = fe — O° =
l»
16)2 = 10
(6 — 38)? = 32
s to C,
ce so Data point 6 belong
Here 4 is smallest distan
Di3,2)2 Vana = VG=2)? =)
b,(3,16) = (@= a = VG 16)" =
.
fro
=35
Gana = \G— 30"
Ds(3, 48) 38) = (@-0
co so Data point3 belongs to Cy
tan
dis
st
le
al
sm
lis
Here
ers
derafted by BackkBench
Publications
|
Page 79 of 102
s erences ets
tr
nese e
a
p
Scanned by CamScanner
i
ie
2 OLUtion
8, in
Chap - 7 | Learning with Clustering
a.
31:
Di(31, 2) = fay? =
(1-2)? = 29
D.(31, 16) = {Gaye= (Or
16)" = 1
: int 31 belongs tos
Vena? “¥Gh so on
ee
Data point
distance
smallest
is
0
Here
72:
Di(12, 2) = {@ — a)? =
D2(12, 16) = (x
— ay? =
(2-2)? = 10
(12 — 16)? = 4
Ds(2, 38) = {(@@
~ ay? = f(z
— 38)? = 26
P
Here 0 is smallest distance so Data point 2 belongs to%
15:
DiS, 2) = {@— a)? = f5— 2)? = 3
Da(15, 16) = {@— ay? = (G5 — 16)? = 1
Ds(15, 38) = {@
— ay? = (15
— 38)? = 23
Here
1is smallest distance so Data point 15 belongs to Cz
16:
Di(16, 2) = (@— a)? = (6-2)?= 14
D.(16, 16) = f(x
— a)? =
(16 - 16)? = 0
Ds(16, 38) = /(x
— a)? =
(16
— 38)? = 22
Here 0 is smallest distance so Data point 4 belongs to C2
38:
D(38, 2) = {@
— a)? = (8-2) = 36
1238, 16) = f(x
— a)? =
(8 — 16)? = 22
D-(38, 38) = {@
— a)? = (8-38)? = 0
Here O is smailest distance so Data point 38 belongs to C;
35:
1D,(35, 2) = f@
— a)? = Y(85
— 2)? = 33
D2(35, 16) = fe
— a)? = (35
— 16)? = 21
D3(35, 38) = /(@
— a)? = {5
— 38)? = 3
Here 3 is smallest distance so Data point 35 belongs to C;
14:
D,(14, 2) = {@—a)*= (04-2)? = 12
D2(14, 16) = fe
— a)? = (04-16)?= 2
D(14, 38) = f(@ — a)?= (04-38)?= 24
Here 2 is smallest distance so Data point 14 belongs toc,
21:
D,(21, 2) =
f— a)? = ¥(21=2)? = i9
D.(21, 16) = f(— a)? = {QT
- 12
=5
¥ Handcrafted by BackkBenchets Publications
ee
“a
pace 80°
d
Scanned by CamScanner
FS
; EP _7| Learning with Clustering
wwww.ToppersSolutions.com
ps(2, 38) = Yea)? = JT aa = 17
Here 5 is smallest distance so Data point 21 belongs to C;
D:(23,2) = J@— a)? = (3B
D,(23, 16) = V(e=a)y
De = ay
= V@3— 16) = 7
0:(23, 38) = (xa)? = /@3— 3a = 15
Here7 is smallest distance so Data point 23 belongs to C;
25:
D(25,2) = (=a)? = (@5—ae
= 23
D,(25, 16) = f(v—=a)? =
/@5—1He=9
= 3a
D125, 38) = f(r -a)? = J(@5
= 13
Here 91s smallest distance so Data paint 25 belonas to C;
30:
D(3C,
2) = fe —a)? = J@0—
27 = 28
= 16) = 14
D.(30, 16) = Y= a)? = YG
(,(30, 38) = /(r—a¥ = (G0 38) = 8
Here
& 1s smallest distance so Data point 30 belongs to C;
The clusters will be,
C=
{2,4,6,3},
C> = {12, 15, 16, 14, 21, 23, 25},
C3 = {31, 38, 35, 30}
Now
we have to recalculate the centre of these clusters as following
C= tet
= 18 = 2.75 (we can rcund off this value to 4 also)
C.= vasaseteeraszie9628 =*-18
Cse
—~
= =
= 33.5 (we can
round
of this value to 34 also)
Now we will again calculate distance from each data point to all new cluster centres,
Ps
- 4)? =2
D2.4) = f@—ay = V2
0.2, 18) = (@— a}? = ¥@ — 18)? = 16
Dii2, 84) = YO — a? = V2 — 34)? = 32
to C,
Here 2 is smallest distance so Data point 2 belongs
\
&
D4, 4) = J@-9 =/(4-4)7=0
D.(4,18) = (@—a = V4 18)? = 16
D434) = Va
VGH 34} =50
so Date point 4 belongs te
Here 0 is smallest distance
(
;
* Handcrafted by BackkBenchers Publications
a
Page 81 of 102
ee
ee
Scanned by CamScanner
|
WWW-TOPPErsSolutig,,
i
eee
Chap- 7 | Learning with Clustering
G:
DG, 4) = JY apt =
VO =e
= 2
D.{6, 18) = (= aye = SE 18)* 7 2
— 34 = 28
Ds(6, 34)= /@— ay = Ve
-
Data point 6 belongs to
Here 2is smallest distance so
3:
D3, 4)= (@—ay = {G-H*=1
D.(3, 18) = ((— a) = (@— 18)? =15
— a)? = /@-347 =31
Ds(3, 34) = {@
belongs to Ci
Here 1is smallest distance so Data point 3
3]
— 4? = 27
— ay? = (GI
D,(31, 4) = (@
— 18)? = 8
D.2(31, 18) = {@— ay? = (1
— a)? =
D5(31, 34) = {@
(G1 — 34)? =3
belongs to Cs
Here 3 is smallest distance so Data point 31
T2:
|
Dy(12, 4) = /@ — ap? = (02-4)? = 8
|
=6
D.2(12, 18) = /(@ — a)? = (12 — 18)?
— 34)? = 22
— a)? = (12
Ds(12, 34) = JG
Here 6 is smallest distance so Data point 12 belongs to C2
15:
D,(15, 4) = (@— a)? = (05-4)? =71
— 18)? =3
— a)? = (5
D.(15, 18) = /@
— 34)? =19
— a)? = V5
D:(15, 34) = (@
Herve & is smallest distance so Data point 15 heicrigs to C,
16:
— 4)? = 12
— a)? = {6
Di(16, 4) = /@
— a)? =
D.2(16, 18) = /@
/(16 — 18)? = 2
— 34)? = 18
D:(16, 34) = (@ — a)?= /(16
Here 2 is smallest distance so Data point 4 belongs to C2
38:
D.(38, 4) = ea aye= (GB= Aj? = 34
— a)? =
D»(38, 18) = /@
/(38 — 18)? = 20
D3(38, 34) = /@— a)? =
(38 — 34)? =4
belongs to C,
Here 4 is smallest distance so Data point 38
35;
— a)? = ¥35— 4)? = 2
D1(35, 4) = V@
= 18)? =17
— a)? = ¥(35
1.35, 18) = f(
Publications 2
capenchers Pabheeg
® Handcrafted by Bach«Benchers
oo 8? of”
nage
Scanned by CamScanner
SEE SED
v
pap
REE
| Learning with Clustering
www.ToppersSolutions.com
a
—
ps(35.34) = >a! = VOTRE - y
Here 1s smallest distance so Data point 35 belongs to C,
is
(14,4) = {aj
= (G4 ~ HF = 10
Da(I4, 18) = Oe = a)? = (Tay? = 4
— ay? = (AT
Ds(14, 34) = fx
31y? = 20
Here 4 Is smallest distance so Data point 4 belongs to C,
2k:
D(21, 4) = JO =a)? = {Q1— a? = 17
0,(21, 18) = {@— a}? = (= 18)? = 3
Here 3 is smallest distance so Data point 21 belongs to
Cz
} 23:
D(23, 4) = ((—a) = (3-4? = 19
D,(23, 18) = f(x
— a)? = (23-18)? = 5
(22, 34) = f(x
= a)? = f(23—34)? = 11
i
Here 5 is smallest distance so Data point 23 belongs to C?
25:
(25, 4) = {@—a)* = {25-47= 21
1D,(25, 18) = J
— a)? = J (25 — 18)? =7
(25, 34) = @ -a)2 =
(25-34)? =9
tw
es
Here 7 is srnallest distance so Data point 25 belongs to C2
D,(30, 4) = .[@ =a)? =
(G0— 4)?= 26
1D,(30, 18) = f(x — a)? = J(30- 18)? = 12
1:(30, 34) = f — a)? = (G0 - 34)? = 4
s to Cz
Here 4 is smallest distance so Data point 30 belong
| The updated clusters will be,
C) = {2, 4, 6, 3),
ne a
| Cz = (12,15, 16, 14, 21, 23, 25),
Cy = (31, 38, 35, 30}
Raia
MeL
Sg MSN ERO
We can see that there is no difference between
previous clusters and these updated
clusters,
so we will
Stop the process here.
Finalised clusters ~
G= (2, 4, 6, 3},
|
C= (12, 15, 16, 14, 21, 23, 25),
C= (31, 58, 35, 30}
Be
“een
es
—
.
oe
Publications
Y Handerafted by BackkBenchers
Page 83 of 102
Scanned by CamScanner
1
'
P ’ f
i
li _
3 SoOlut;
www. Topporss,
a
_
g
ustering
i clus
Chap - 7] Learniin with
~
Q8.
Ons
tie
erence
ttc
initial dun, 7 ‘.
22. Use c,(2,4) & C,(6, 3) as
oe
K°*
datamefor
on gii ve™
m
th
h
t
ri
go
ns al
?
Apply K-mea
d(6 3 ), e(4 els 3),
b(3,3), ¢(5,5),
,
4)
2,
a(
:
ta
Da
all cluster
centres.
7
|
|
)
d
r = ( 4),} C2= (6.3 wats an
Initial cluster centre - fo C i= 2,
.
points ¢
een data
We will check distance betw
|
ey
Ans:
rs k = 2
Number of cluste
"
will
We
use
$ Uetidear
. |
formula for finding distance.
Distance [x, a] =
(x — a)?
OR
alt
Distance [(x, y), (a, b)] = V¥@-
or by
Euclidean
ond formula of
sec
use
will
we
pair,
in
is
data
As given
res.
:
.
a.
and cluster centres
points
Finding Distance between data
Distance.
We will use following notations for calculating Distance.
D, = Distance from cluster C. centre
D2 = Distance from cluster C2 centre
2, 4):
D2, 4), (2,
4-47 =0
+ G5) = (@=a+
= Y@-a?
Dill2, 4), (6, 3) = YR OEF
= 3)? = 415
ODP = VE= H+
belongs to cluster Ci.
Here O is smallest distance so Data point (2, 4)
af cluster C; as followingAs Data point belongs to cluster G, we will recalculate the centre
Using foliowing formula for finding new centres of cluster =
Centre [(x. y), fa, b)] = (=2.2")
Here, (x, y) = current data point
(a, b) = old centre of cluster
Updated Centre of cluster C=
xXta ytb\ _ (2+2 444
) = (=) = (2,4)
(—=,4>
2
3, 3):
Dil, 3), (2. Y=
(Ea + O= dF = (BDF GoD = 142
DA(3, 3), (6, 3)] = V@—a)? + (y—b =
J(3—6)? 4 (3-3)? =3
Here 1.42 is smallest distance so Data point (3, 3) belongs to cluster
'
uster
C
C,.
7
As Data point belongs to cluster ©, we will recalculate the centre of cluster er C,C) as
followingas followi
344
Updated Centre of cluster C = (= **) = (32
22)
(>)
= (25,35)
5, 5):
Dil{S, 5), (2.5.35) = Viw~ aye +
GBF by? =
(y ~
DullS, 5), (6, 3 = Ve-al
+ Oe:
(5- 2.5)? 4 (5 ~
3.5)? =292
ESE
5-6)? + 6
ber
:
so Data point (5,
Here 2.45 iis smallest distance
‘
Handcrafted by BackkBenchers Publicat ions
—3)2=
e Ongs
a
2.45
to cluster
Cc
sore
aaa mma
Scanned by CamScanner
eh EET
EO ot
EEE
vy
{sq
;
aga
upd
www.Topperssolutions.com
————
i cha
|
Learning with Clustering
gata point belongs to cluster C,, we
wil r
2
ated
Centre
of cluster
Co=
(eu
re
will recalculate the centre of cluster C, as following-
“*)
_ (#8
543
7S) = (55, 4)
>)
Rte
63)
Dill, 3).(25,3.5))= J@la ee
a
|
4) NS =
- , 3),3), (55,
(5.5,
MS
VQ= OF ODE _ = (CHDK. FGA THP = 356
V@—-aYa Goby
Goa? = 12
V@- as GbE = [= 55)
Here 1.12 Is smallest distance so Data point (6, 3) belongs
to cluster C2
| as Data point belongs to cluster C,, we will recalculate the centre of cluster C2 as following=
e
[Xta
yrb
:
= (5:75, 35)
s i)
- (: b+5.5
C, =. (4)
of cluster
"updated Centre
' (43):
Dil(4, 3), (2.5, 3.5)]
=
V(x -al+
Dll4, 3), (5:75, 3.5)] = Ja
G
by
= Ja — 2.5)? +
(3 —3.5)2
= 1.59
+ (= by = (A= 575) + B= 3.5)* = 183
Here 1.59 is smallest distance so Data point (4, 3) belongs to cluster C,
As Data
point beiongs
to cluster
©, we will recalculate the centre of cluster C; as following-
Updated Centre of cluster C, = (=
“t) 7 (7#22*)
= (3.25, 3.25)
(6,.6):
DAL(6, 6), (3.25, 3.25))=
f@-a?+
De[(6, 6), (5.75, 3.S)] =
V(x-a)*? +
Here
2.52 is smallest
JG — 3.25)? + (6 — 3.25)? = 3.89
(y—b=
(yb)?
(6 — 3.5)? = 2.52
Y(6- 5.75)? +
=
distance so Data point (6, 6) belongs to cluster C)
The final clusters will be,
C= {(2, 4), (3,3), (4, 3), (9. G)},
C= {(5, 5), (6, 3)}
Apply Agglomerative clustering algorithm on given data and draw dendrogram. Show three
‘
method.
clusters with its allocated points. Use single link
Adjacency Matrix
v2
a | oO
b[ v2
|0
[2
e|
fF
V5
|1
v2o | via |2
[1
vs | v5
lc | vio | ve |
ld} viz |
V5 | V20
10 | ¥17 |
8
f
e
d
Cc
b
a
vs
|o
[2
v5
12
0
{3 | vis
Vis
[2
3
vi3
]o
——
[10M - Mayl6 & Dec17]}
‘ Ans;
‘
Jve this sum. We use following formula for Single link sum,
;
| We have to use single link methed to so
| Min [dist (a), (b)} one
We will choose smallest distance value from two distance vaiues.
4
Be
:
Handcrafted ky BackkBenchers Publications
».¥
Page
85 of 102
.
Scanned by CamScanner
|
WWW.-TOPPErsSolutio,
Son
Chap ~ 7 | Learning with Clustering
OS TET
BO
OS
Le oe
ae
;
can see 2 tha
tt
h
e upper bound part of diagona| idens
* “Al t
We have givers Acjacency matrix in which we ©*
atrix
44
art of the Mate.
lower bound of diagonal se we can Use any par
show below.
Wo will use Lower bound ef diagonal as
;
e matrix. We can see that1 is smay.
Mallen
distance value from above distanc
it.
from
e
matrixi so We can choos e anyYY one valu
clistance value but it appears twice in
Now we have to find minimum
Taking distance value of (b, @) =1
Now we will draw dendrogram
Now
we have
for it
to recalculate distance matrix.
remaining
We will find distance between clustered paints Le. (b, e) and other
points.
e Distance between (b, e) and a:
|
Min[dist (b, e), a]
= Min{dist (b, a), (e, a)]
= Min{V2, V5]
|
sh. ccscasentinaw isin as we have to choose smallest value.
* Distance between (b, e) and c:
Min|dist (b, e), ¢] = Minfdist (b,c), (e, ¢)] = Min[v@, v5] = ¥5
« Distance between (b, e) and d:
Minf{dist (b, e), d] = Min{dist (b, d), (e, d)] = Min{1,2] = 7
¢ Distance between (b, e) and f:
Min{dist (b, e), f] = Min{dist (b, f), (2, t)] = Min[v18,13] = V73
Now we have put these values to update distance matrix:
—
|.
ibe)
c
d
f
[V2
vi0
Vii
V2
Now we have to find smallest cistance value from updated
_|
distance mat
rix again
Here distance between [(b, e), d] = lis small
est distance value j
ne
ae in distance matrix
a
¢ Handcrafted
by BackkBenchers Publications
Scanned by CamScanner
i|
‘
ae? jb earning 9 wit withClustering
.com
www.ToppersSolutions
wwe have to draw de
ndrogram for the Se
New clustered points
.
I
wow recalculating distance matrix again
finding distance between clustered points and other remaining points.
|, Distance between [[(b, e), d] and a:
|
Min (dist [(, e), d], a} = Min {dist [(b, e), a], [4, al} = Min (V2, V17] = v2
_, Distance
between
[(b, e), d] andc:
|
Min {dist {(b, e), a], c} = Min {dist {(b, e), ¢),[d, ¢]} = Min [V5, v5] = V3
|. Distance between [[b, e), d} and
#:
Min {dist [(b, ¢), d], f} = Min {dist [(b, e), f], [d, f]} = Min [v73,3] = 3
ow we have to put these distance values in distance matrix to update it.
a
Here distance between
(b, e,d)
}c
f
[(b, €, d), a] = V2 is smallest distance value in updated distance matrix.
Drawing dendrogram tor new clustered points
Now recalculating distance matrix.
c:
Finding distance between [(b, e, d), al and
c]} = Min [¥5, V10] = v5
Min (dist [(b, e, a), a}, [c]} = Min {dist [(b, e, d), ], [a,
_*
|
»
d), a] and f:
Finding distance between [(b, e,
fl, {a, f]} = Min [3, V20] =3
Min {dist [(b, e, 2), al, [fl] = Min {dist [(b, e, d),
distance matrix to update it.
Putting these new distance values in
(b,e,d,a) ] ¢ | F
(b, e,d, a) | O
c
f
V5
3
O
0
2 Ba
f 4 : Handcrafted by BackkBenchers Publications
:
Page 87 of 102
Scanned by CamScanner
x
Lunesta
Wwww.TOPDETSSolutio,
Cg
.
Chap - 7 | Learning with Clustering
.
.
Here, distance between (¢, f) = 2is smallest
™
distance.
[
4
Recalculating distance matrix,
«
Finding distance between (c, f) and (b.e.d.a)
d, a)]} = Min [v5, 3] = V5
Min { dist (c, f), (b, e, d, a)} = Min (dist [¢, (b, & d, a)], [F, (Be %
Putting this value to update distance matrix
Drawing dendrogram for new cluster,
|
[|
c
Ql0.
f
o
a
e
For the given set of points identify clusters using complete
link and
average link using
agglomerative clustering.
A
B
|
1
P2
115
15
P3
15
5
P4
13
4
P)
rps [aya
Cae
Ans:
[loM - May!@l
We have to solve this sum using complete link
method and Average link
Ink methoh d.
oe
.
We use following formula for complete link sum
.
7
Max [dist (a), (b)] ..........-....... We will choose biggest distance
.
value fr Om two distance
j
value S.
We use following formula for complete link sum
Avg [dist (a), (bJ]=* 5 [a+b]
é
~ We will choose biggest dista nce value from two distance
yalue?
ne
i
tr
x
i
¥ Handcrafted by BackkBenchers Publications
ee
a
ae
Ee
—
Page
—
ga of ™
Scanned by CamScanner
with Clustering
7! Learning
ee,
chap
: yore given data is notin distance20 /
adja
AC
VAce
CENCY
tNatriyi form,
&So we will convert it to distance / adjacency
gattix using Euclidean distance formul
a a WhichWhich j1S
a8 followinc
bye
fv
,
www.ToppersSolutions.com
a
.
a
at
'
gistance [% ¥) (a, BY] = fo
“
=
«
ayey. (y~b)?
vm
yop
|
yow finding distance between p} and P2, Dista
“7
.
\
Stance
J,
P2]
ine
| gistancelt 1), (1.5, 1.5) = {aye
| jasperabov
hovee ste
stepp we we wil ¢ OO* OW = (OTTaTae
will find distance between other
points as
= 0.7
well
Distance [P1, P3] = 5.66
Distance [P1, P4] = 4.6)
Distance [PI], P5) = 4.25
Distance [P1, Pé] = 3.21
Distance [P2, P3] = 4.95
Distance [P2, P4] = 2.92
Distance [P2, PS] = 3.54
Distance [P2, P6] = 2.5
Distance
[P3, P4] = 2.24
Distance [P3, P5] = 1.42
Distance [P3, P6] = 2.5
Distance [P4, P5] =1
Distance [P4, P6] =0.5
Distance [P5S, P6] = 1.12
Putting these values in lower bound of diagonal of distance / adjacency matrix,
PS
P4
P3
P2
Pl
p2
O71
a
O
PS
5.66 |
495
0
| P4
3.61
292
PS.
4.25
3.54
2.24
142
0
7
0
[PG
3.21
2.5
25
0.5
12
ipt”™*é‘“‘i‘C
|
I
5
.
COMPLETE LINK METHOD:
' Here, distance between
P6
P4 and PG is 0.5 which is smallest distance in matrix.
P4
P6
matrix:
scalculating distance
“©
P1
Distance between (P4. PE) and
= Max{dist (P4, P6), Pll
= Max{dist (P4, Pi), (P6, PN]
= Max(3.61, 3.21]
= 3.6) mms
Page 89 of 102
Se
Mun.
8
we
!
lete link methoa
nce value as it is Comp
a
dist
t
ges
big
We take
ant
Renchers
ue
Publications
Scanned by CamScanner
SSI
Lo
”
‘
www-TOPPersSolutia,,,
oy
y
Chap -— 7 | Learning with Clustering
Distance between (P4, P6) and P2
= Max{dist (P4, P6), P2]
= Max{dist (P4, P2), (P6, P2)]
= Max(2.92, 2.5]
= 2.92
Distance between (P4, P6) and P3
= Max{dist /P4, P6), P3]
= Max{[dist (P4, P3), (P6, P3)]
= Max[2.24, 2.5]
=2.5
Distance between (P4, P6) and PS
= Max{dist (P4, P6), P5]
= Max[dist (P4, PS), (P6, PS)]
= Max], 1.12]
= 1.12
Up pdating distance matrix:
Pl
O
Here (P1, P2) = 0.71 is smallest distance in matrix
PI.
Pl
P2
Recalculating distance matrix:
e
Distance between
(P1, P2) and P3
= Max[dist (P1, P2), P3]
= Max{dist (P1, P3), (P2, P3)]
= Max[5.66, 4.95]
= 5.66
Distance between (PI, P2) and (P4, P6)
= Max{dist (Pi, P2), (P4, Pe)}
= Max{ dist (P1, (P4, P6)}, [P2, (P4,
P6)}]
= Max[3.61, 2.92}
= 3.61
\ Handeratted by BackkBenehere Publeatong
Handcrafted by BackkBenchers Publications
eo
page 90°
-
ad
Scanned by CamScanner
“nap 7! Learning with Clustering
ns.com
www.ToppersSolutio
pistance between (P1, P2) and ps
= Max([dist (P1, P2), PS]
- Max[dist (P1, PS), (P2, PS)
2 Max[4.25, 3.54]
2425
updating distance matrix using above values
(PI, P2
(Pi, P2)
p4,P6
(
| i
)
5
es
0
-—__]
5
5.66
| P3
a
(P4, P6)
P3
3.61
25
5
te
1.42
112
|
O
Here [(P4, P6), PS] = 1.12 is smallest distance in matrix,
Pll
Ed
PS
P1
P2
Recaiculating distance matrix
+
Distance between [(P4, P6), PS} and (P1, P2)
= Max[dist {(P4, P6), PS}, (P1, P2)]
= Max[ dist {(P4, P6), (P1, P2)}, (PS, (P1, P2)}]
= Max[3.61, 4.25]
=425
*
P3
Distance between {(P4, P©), PS} and
= Max{dist {(P4, P6), PS}, P3]
P3)]
= Max{ dist {(P4, PS), P3}, {P5,
= Max[2.5, 1.42]
=2.5
Updating distance matrix
(P1, P2)
(Pi, P2)
(P4, P6, PS)
0
5.66
(P4, P6, PS)
Here, [(P4, P6, P5), P3] =
RS
4.25
ance in matrix,
2.5 is smallest dist
Page 91 of 102
Dublications
_* Handcrafted by BackkBenchers
Scanned by Cam Scanner
Vivi vite tM RR
POU
tign,
Pn,
——
Chap - 7| Learning with Clustering
6
pA
pS
P3
pl
P2
Recalculating distance matrix:
*
Distance between {(P4, P6, PS), P3} and
{P1, P2}
= Max(dist {(P4, P6, PS), P3}, (PI, P2}]
= Max{dist{(P4, P6, PS), (PT, P2)}, (P3, (Pl, P2)}]
= Max[4.25, 5.66]
= 5.66
Updating distance matrix
| (P4, P6, P5, P3)
|(Pl, P2)
(Pl, P2)
0
(P4, P6, PS, P3)
5.66
a
P5 P3
P4
AVEPAGE
py
P2
LINK METHOD:
Original Distance Matrix:
Pl
PS
P4
P3
P2
| PS
0
PI
0
P2
0.71
P3
5.66
4.95
P4
3.61
2.92
7
2.24
0
PS
4.25
3.54
1.42
1
5
25
25
a
=
P6
3.21
-
0
|
|
rT
~
=a
Lo
Here, distance between P4 and P6 is 0.5 which is smallest distance in matrix
P4
6
a
ee
Handcrafted py BackkBenchers Publications
paye
of I?
i
Scanned by CamScanner
-Tp oes
ent ~'ustering
ee
we
_giculating distance matriy
pistance between
ot
Www.ToppersSolutions.com
oo
’
(P4, P6) and py
- avg[dist (P4, P6), P)]
- Avgldist (P4, Pl), (P6, Pl}]
2213.61 + 3.21)
2 BAT ccecctteeecneeeee. WO
take average distance
:
value as it is
Average link method
Distance between (P4, P68)
and P2
= Avg[dist (P4, P6), P2]
= Avg[dist (P4, P2), (P6,
P2)]
z : [2.92 + 2.5]
=271
«
Distance between
(P4, P6) and P3
= Avgldist (P4, P6), P3]
= Avg{dist (P4, F3), (P6, P3)]
= 5 (2.24 +25)
= 2.37
«
Distance between
(P4, P6) and PS
= Avg|[dist (P4, P6), PS]
= Avgfdist
(P4, PS), (P6, PS)]
= +f +112]
= 1.06
‘Jpdating distance matrix:
|
T Pi
P2
P3
IP
oO
| P2
0.71
0
ps
5.66
4.95
0
(P4, P6)
PS
3.41
4.25
27
3.54
237
1.42
(P4, P6)
PS
0
1.06
[°
}lere (Pl, P2) = 0.71 is smallest distance in matrix
Plt
P1
P2
Recalculating distance matrix:
*
Distance between
(P1, P2) and P3
= Ava{dist (Pl, P2), P3]
= Ava|dist (P1, P3), (P2, P3)]
= +[5.66 + 4.95]
=53)
~
ications
_¥ Handcrafted by BackkBenchers Publ
Page 93 of 102
Scanned by CamScanner
4
WWww.TopPersSolut; on
Si
Chap -7 | Learning with Clustering
*
Distance between (P1, P2) and (P4, P6)
= Avg[dist (P1, P2), (P4, P6)]
= Avg[ dist {P1, (P4, P6)}, {P2, (P4, PEI}]
= 5 (341+ 2.71]
= 3,06
¢
Distance between
(P1, P2) and PS
= Avg[dist (P1, P2), P5]
= Avgldist
(P1, PS), (P2, P5)]
=5[4.25 + 3.54]
= 3.90
Updating
distance matrix
using above values,
(Pi, P2)
(Pl, P2)
p3
oO
P3
0
(P4, P6)
2.5
PS
1.42
Here [(P4, P6), P5] = 1.12 is smallest distance in matrix,
rd
Recalculating
¢
[|
Pl
P2
distance matrix:
Distance between {(P4, P6), PS} and (PI, P2)
= Avgldist {(P4. P6), P5}, (P1, P2)j
= Avg[ dist {(P4, P6), (P1, P2)}, {P5, (P1, P2)} ]
> [3.06 + 3.90]
= 3.48
Distance between {(P4, P6), P5) and P3
= Avg[dist {(P4, P6), PS}, P3]
= Avg| dist {(P4, P6), P3}, {P5, P3}]
1
5 [2.5 + 1.42]
Mt
¢
1 96
Updating distance matrix
(P), P2)
P3
(P4, P6, PS)
(P, P2)
0
5.66
3.48
(P4, P6, PS)
om,
Lap _7| Learning with Clusterin
g
ere, [(P4, P6, PS), P3]
= 1.96 is SMallest dist
ance in
www.ToppersSolutions.com
Matrix,
PLT
gecalculating distance matrix:
Pl
P2
Distance between {(P4, P6, PS), P3} and {P1, p2)
Avgldist ((P4, P6, PS), P3}, {P1, P2y]
= Avgldist{((P4, PG, PS), (PI, P2)}, (P3, (PI, P2)}
=1 (3.48 + 5.66]
2457
Updating distance matrix
(Pi, P2)
(P4, Pe, P5, PS)
7
(Pi, P2)
a
457
| (P4, Po, PS, PS)
; BackkBenchers
Publications
Page 95 of 102
icatl
Scanned by CamScanner
www. Topparssolutlons 00
A
Chap - 8 | Reinforcement Learning
ee
Om
_
CHAP- 8: REINFORCEMENT. BReY
RNING
arning?
What are the elements of reinforcement banenine help of an example.
Ql.
arious atements involved In form,
Q2..
What is Reinforcement Learning? Explain with the
Q3. .
Explain reinforcement learning in detail along with ne servablele state.
state
"
the concept. : Also define what is meant by partially
[5-10M
| May16, Decl6, Mayl7,
ob
17,D Decl7 & Mayra}
Ans:
REINFORCEMENT LEARNING:
1.
algorithm.
e of machine learning
typ
a
is
ng
rni
Lea
t
Reinforcemen
and error.
nm ent by trial
4.
It enables an agent to learn in an interactive enviro
.
Agent uses feedback from its own actions and expenences,
.
.
.
imnereas
l
a
reward
of
the
agent.
e total re
The goal is to find a suitable action model that increas
5.
Output depends on the state of the current input.
6.
The next input depends on the output of the previous input.
7.
Types of Reinforcement Learning:
2.
3.
}
1
8.
a.
Positive RL.
b.
Negative RL.
The most common
a.
application of reinforcement learning are:
PC Games: Reinforcement learning is widely being used in PC games like Assassin's Creed, Chess
etc. the enemies change their moves and approach based on your performance,
b.
Robotics:
Most
of the
robots
that
see
you
in the
are
world
present
running
on
Reinforcemert
Learning.
EXAMPLE:
1.
We have anagent and a reward, with many hurdles in between.
2.
The agent
is supposed
to find the best possible path to reach the
on
‘
.
4
i
reward.
a <
oma)
Figure 8.1: Example of Reinforcement
Learning.
3.
The figure 8.15 hows robot, diamond
4.
The goal of the robot is to get the reward
5.
Reward is the diamond and avoid the hurdles that is fire.
6.
The robot learns by trying all the possible paths
7.
After learning robot chooses a
path which gives him the reward
with the least hurdles
8,
9.
and fire.
Each right step will give the robot a reward.
Each wrong step will subtract the reward of the robot,
reward will be calculat
it
10. The total
Sen
ed when it reaches the final reward
“Ward
wy Handcrafted by BackkBenchers Publications
ISS
;
d.
that is the diamon
Cn
102
Scanned by CamScanner
a
BI Reinforcement Learning
Topparssolutions60M
LEARNI
porcEMENTNT LEARNING
ELEMENTs:
Z
eee four Main Sub elements of 5
ew
www.ToppersSolutions.com
=
reinfor cementni
leaar
le
rninina system:
, pore,
r
,yeward function
Are
pyalue
function
A model of the envi ronment
IS )
(optional
|
>»
op a
z
|
>
a,
is
=
»
ig!
er
a
ig
4
ae
z
po
i
i
CBstERVATIONS
Policy:
The policy is th
core of a reinforcement learning
=p
Pobevis
;
Policy IS sufficient
Suiliciént to
to dearer
determine
3
ingeneral,
4
Apokcy defines the learning agent's way of behaving at a given time.
policies may
bs,
behaviour
be stochastic
she environment to actions to be taken when
in those
3.
=
a
5=e
w&
5
b»
9
{Q
oO
ay
perc
each
mans
function
Peward
=
o
0
Wi
of
Areward functuo
Qa.
tb
1
~
table
d bad events are for the agent.
The reward function defines wh
erable by the agent.
nec
must necesse
funct.o7m n must
rdd funct.o
T
tne policy.
Reward functions sere as a basis for altering
3
lil) Value Function:
a reward
Whereas
1
>
Avelue
5
The
function
value
funaction
starting
from
4
F
ds
5
Wathout
5 the
to
in an immediate sense,
in the fong run.
Species ' vfat 1ssua
of a state
is good
indicates what
amount
cf reward
an
agent
can
expect
to collect
over
the
future,
that state
are ina sense primary, whereas values. a5 p redicticns of rewards, are secondary.
rewards
there could be no
values, and the only purcose of estimating values
is to achieve
more reward
we are most concerned when making and evaluating decisions. Action choices
s.
are made based on value judgment
element of Reinforcement learning system
behaviour of the environment.
%
action, ¢ ne mode! might predict the resultant next state and next
€or example, given a state and
reward
4
nning.
plad
for e
Models are us
5. if model know current state @
e and
ion,
aciic
next reward.
Gre ase
predict the resuultant next state and
then it can
iit
a to
hers Publications
pZHanderafted by BackkBenc
Page 97 of 162
Scanned by CamScanner
Chap — 8 | Reinforcement Learning
Q4.
www. ToPPersSolutions co,
$$
Model Based Learning
Ans
ns:
MODEL
[om
BASED
LEARNING:
1.
arni
Model-based machine learning refers to machine learning
M odels.
2.
Those models are parameterized with a certain num ber of parameters which
do not change ag the
size of training data changes.
3.
|Deciy
:
.
The goals “to provide a single development framework which supports the creation of a wide range
of bespoke models",
4.
5.
‘
}
that a set of data {Xl, .., n, YI... MN} you are
given
subject toa linear
model Yi=sign(w'X, + b), Where w € R™ and m is the dimension of each data
point regardless of n,
For example,
if you
assume
There are 3 steps to model based machine learning,
namely:
a.
Describe the Model: Describe the process that generated the data using factor graphs.
b.
Condition on Observed Data: Condition the observed variables to their Known quantities.
Perform Inference: Perform
backward
reasoning to update the prior distribution over the laten:
variables or parameters.
6.
This framework emerged from an important convergence of three key
ideas:
a. The adoption of a Bayesian viewpoint,
b.
c.
MODEL
i.
The use of factor graphs (a type ofa probabilistic graphical
model), and
The application of fast, deterministic, efficient and approximate
inference algorithms.
BASED
LEARNING
PROCESS:
2.
We start with medel based learning where we
completely know the environment
Environment pararncters are P(rt +1) and P(st +1] st,
at).
In such a case, we do not need any exploration,
“4.
Wecan
model parameters
directly selve for the option al value function and
policy using dvnamic Proyramirnin
g
The optimal value function is unique and
is the solution to simultaneous equations
Once we have the optimal value function, the aptima
l policy is to choose the action that maximize
value in next state as following
Tr * (st) = args: max(E [rt +1] st, at] +r y Pist + If st,
at] v * [st + 1)
BENEFITS:
hWN
1.
Provides a systematic process of creating
ML solutions.
Allow for incorporation of prior knowledge,
Does not suffers from over fitting.
Separates model from inference/traini
ng code.
——.
® Handcrafted by BackkBenchers
————
Publications
a
_
ee
ee
page 98
of 102
Scanned by CamScanner
|
-e..
9-8 | poinforcoment Learning
po
ar i
a
eco
www.ToppersSolutions.com
ao ans
a
explain in detail Temporal Difference Learning
fb
[IOM | Dec6, May17 & Dec18]
j
MPORAL DIFFERENCE
LEARNING:
pean approach to learning how to predict a quantity that depends on future values of a given signal.
jermporal Difference Learning methods can be used to ebtirnate these value functions.
peat hing Happens through the iterative correction of your estimated returns towards a more accurate
rarget
relay
Walgonttinns are often Used to predict a measure of the total arnount of reward expected over the
future
they can be used to predict other quantities as well.
Different
1
TO algorithens:
a
$30) algonthen
b
TD)
©
190) algorithen
algonthen
The easiest to understand temporal-difference algorithrn is the TD(O) algorithm.
TEMPORAL
)
DIFFERENCE
LEARNING METHODS;
On-Policy Temporal Difference
methods:
Learns the
The value
§
value of the policy that is used to make decisions.
functions are updated
using
[hese policies are usually "soft" and
results frorn executing actions determined
by some
policy.
non-deterministic.
ILensures there is always an element of exploration to the policy.
The policy is not so strict that it always chooses the action that gives the most
6
On-policy algorithrns cannot separate exploration
reward.
from control.
ll) Off-Policy Temporal Difference
methods:
1}
it candeann
different
policies for behaviour
anc estimation
2
Again, Whe behaviour policy is usually "soft" so tnere is sufficient exploration going an
Off-policy alyorithrns can update the estimated value functions using
actions which
have not actually
been tried.
Off-policy algorithrns can separate exploration from control.
»
Agent
may
eng up learning tactics that it did not necessarily shows
ACTION SELECTION
during
the | 2arnin
h
=
9g phase,
POLICIES:
The aim of these policies is to balance the trade-off between
1)
.
exploitation
and exploration
c-Creedy:
Most of the time the action with the highest estimated reward is chosen, called the greediest
action
Every once in a while, say with a small probability, an action is selected at random.
The action is selected uniformly, independent of the action-value estimates.
This rnethod
ensures that if enough
trials are done, each action will be tried an infinite number
of
times,
?
\
5
ensures optimal actions are discovered.
-by BackkBenchers Publications
Page 99 of 102
}
ea
Scanned by CamScanner
.
oo
Chap - 8 | Reinforcement Learning
www ‘Tepperssolutions ¢ bb
"
fy
2- Soft:
1
Vary sitilar toe - greedy.
Z.
The best action: is selected with probability1 -e
5.
Bast of the time a random
action is chosen
uniformly.
MW) Sottcnax:
Gne trausback of e -yreedy and ¢ -soft is that they select randorn ctians Uniformity
a
4
“ny
wo
bo
Sarna
‘
.
a
;
the ’ secand
THE WGESE GOSsiOle action Is just as likely to be selectec! as
2
SNe
4.
»
est
bes
raredies this by assigning a rank or weight to each of the actions
ctording
to
their
actisr.
AStirn ate
Krandart action is selected with regards to the weight associated with each action
VT fiearis that the worst actions are unlikely to be chosen.
&
Frisia good
approach
to take where the worst
actions are very unfavourable
ADVANTAGES OF TD METHODS:
1
Gorn't need a model of the environment.
2.
Grr-line and incremental so can be fast.
|
%
Theydon't
|
4
Needless memory and computation
Of.
need to wait till the end of the episode
— Mriat is Q-learning? Explain algorithm for learning Q
Ans:
[10M | Deci8}
O-LEAPRNING:
1
Grlearning
4
Q-learning is a values-based learning alyorithrn.
is a reinforcernent
learning technique
used
in machine
learning.
fre goal of O-learning is to learn a policy,
4
Glearung
,
&
tells an agent what action to take under what circumstances
Orlearring uses temporal differences to estimate the value of Q'(s, a).
In G learning, the agent maintains a table of Q[SA],
YVinere, Sis the set of states and
A is the set of actions.
7
Q [s, a} represents its current estimate of Q* (s, a).
Ae experience (s, a, 1,5') provides one data point
for the value of ls, a)
&
The data point is that the agent received the future value of rt yV(s4j,
Vihere Vis}= max,’ Ofs', a’)
&
ky
ii.
Ths
the actual current reward plus the discounted estimated
future value,
This nev data point is called a return.
The agent can use the temporal difference equation to upd ate its est
imate for Qls, a):
Ofs, a] « Ofs, a} + afr yrnaxa’ Ofs', a'] - Ofs, al)
12
Ot equivalently,
Ofs, a} ¢ Mo) Ofs, a] + alr yrmaxa' Qfs', a’]).
Scanned by CamScanner
~
EE
nae - 8 | Reinforcement Learning
ESSN
ee
Www. Tapparssolutions.com
g iecan be e p proven that given
i
fol
imi
sufficient
training
under any» - soft policy, the algorithin converges with
ability] to a close
j
i
neil
prob
¥
approximation of the action-yalue function foran arbitrary target policy
:
ing le learns
i
'
4, Q-Learning
the optimal
policy even when actions are selected according to a mure exploratory
or even random policy.
‘0°
E:
|
Q-Table is just a simple lookup table
2
In Q-Table we calculate the maximum
4
Basically, this table will guide us to the best action at each state,
In the Q-Table, the columns are the actions and the rows are the states,
5.
§
expected
future rewards for action at each state,
Each Q-table score will be the maxirnum expected future reward that the agent will get if it takes that
action at that state,
6
This isan iterative process, as we need to improve the Q-Table at each iteration.
PARAMETERS USED IN THE Q-VALUEUPDATE PROCESS:
1)
«>the
learning rate:
Set between 0 and.
2
Setting it to 0 means that the Q-values are never updated, hence nothing is learned.
33.
Setting a high value such as 0.9 means that learning can occur quickly
facto
1
Set
0 and
2
This models the fact that future
3.
Mathematically, the discount factor needs to be set less than O for the algorithm to converge.
between
il) Max,
oe
ll) y- Discount
1.
- the maximum
rewards are worth
less than
revvards.
reward:
1
This is attaunable in the state following the current one.
2.
Le. the reward for taking the optimal action thereafter
PROCEDURAL
immediate
APPROACH:
}
yet tialize
the Q-values table, Q(s, a).
2?
Observe
the current
state, s.
Choose
an action, a, for that state based on one of thie action selection policies (r -soft,
¢ -greedy or
Softrnax).
lake the action, and observe the reward, 1, as well as the new state, s'.
5.
6.
Q7.
Update
the Q-value for the state using the observed
reward
and the maximum
reward
possible for
the next state. (The updating is done according to the forrnula and pararneters described above.)
Set the state to tne new state, and repeat the process until a terrninal state is reached.
Explain
following
terms
with
respect
to
Reinforcement
learning:
delayed
rewards,
exploration, and partially observable states.(10 mark — d18)
Q8.
Explain reinforcement learning in detail along with the various elements involved in forming
(10 mark - d17)
the concept. Also define what is meant by partiaily observable state.
[IOM | Dect7 & Deci8}
‘Ans
: erafted bv BackkBenchers
©i Hand
Publications
Page 101 of 102
A
Scanned by CamScanner
eine
SIOSSISSSSSE TF
2
ERLE LLL
oui netl
Chap
- 8 | Reinforcement Learning
as «|
Ystttl TOpperannutians
we “ee
ty,/
EXPLORATION:
1,
Itmeans gathering more information about the problerty
2,
Reinforcement learning requires clever exploration mechanisns
3.
Randomly selecting actions, without reference to an estimated probabilit
y Oistnbations
stevie tee.
performance.
4.
The case of (small) finite Markov decision
5.
However, due to the lack of algorithrns that properly
scale well with the nurriter of states, sirigty
6
processes is relatively Well uniderstagd
exploration methods are the most practical
One such method is e-greedy, when the agent chooses the action that it believes has trier nase tye gterm effect with probability 1- «.
7.
8.
Ifno action which satisfies this condition is found, the agent chooses
an action Unity at Cae dees
Here,O <e<1isatuning parameter, which is sometimes ¢
hanged, either according (6 4 fized wt edly
or adaptively based on heuristics.
DELAYED REWARDS:
1.
Inthe general case of the reinforcernent learning problem, the agent's
actions detorrnyine sor only as
immediate
reward,
2.
Andit also determine (at least probabilistically) the next state of
the environrmnert
3,
It may take a long sequence of actions, receiving insignifica
nt reinforcernent,
Then finally arrive at a state with high reinforcernent.
4.
5.
6.
It has to learn from delayed reinforcement. This is called
delayed rewards.
The agent must be able to learn which of its actions are
desirable based on revratd that can tave olace
arbitrarily far in the future.
Ts
It can also be done with eligibility traces, which weignt the previous
action a lot
8.
The action
less and so on
before that a
littic less, and
the action before that even
But
it takes for of
computational time.
PARTIALLY OBSERVABLE STATES:
'.
In certain applications, the agent does not know the state exactly.
2.
It is equipped
with sensors that return an observation
USING Which
the agent should
estirates ine
state.
For example, we have a robot which navigates in a roorn
4.
The robot may not know
oO ON
Hw
3.
its exact location in the '00m, or
what else is in roorn
The robot may have a camera with which sensory Observati
ons
are re corded
This does not tell the robot its state exactly but gives Inclic
ation as to its likely
state.
For example, the rob
ot may only know that
there is a wall to its night.
The setting is like a Markov decision process, except that
after taking an action at
The new state s+] is not known but we have an observation
o, ‘Iwhich is stochastic function
of s: a4
p (o:+1 | st, at).
10. This is called as partially observable state.
. @ Handcratted by BackkBenchers
Publications
onnee
Page 102 0
a
Scanned by CamScanner
||
|
“Education is Free.,.. But its Technology used & Efforts utilized which we
charge”
it takes lot of efforts for searching out each & every question and transforming it into Short & Simple
Language. Entire Topper’s Solutions Team is working out for betterment of students, do help us.
“Say No to Photocopy....”
With Regards,
Topper’s Solutions Team.
Scanned by CamScanner
Download